how to create a DF from a DF based on a condition

how to create a DF from a DF based on a condition - python

My current DF looks like this
Combinations Count
1 ('IDLY', 'VADA') 3734
6 ('DOSA', 'IDLY') 2020
9 ('CHAPPATHI', 'DOSA') 1297
10 ('IDLY', 'POORI') 1297
11 ('COFFEE', 'TEA') 1179
13 ('DOSA', 'VADA') 1141
15 ('CHAPPATHI', 'IDLY') 1070
16 ('COFFEE', 'SAMOSA') 1061
17 ('COFFEE', 'IDLY') 1016
18 ('POORI', 'VADA') 1008
Lets say I filter by the keyword 'DOSA' from above data frame I get the below OP
Combinations Count
6 ('DOSA', 'IDLY') 2020
9 ('CHAPPATHI', 'DOSA') 1297
13 ('DOSA', 'VADA') 1141
But I would like the output to be like the df below(which has ignored the filter key word as its common,
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141
What concept of pandas needs to be used here? How can this be achieved?

In general, it's not ideal to have list, tuples, sets, etc inside a dataframe. It's better to have multiple records for each instance when needed.
You can use explode turn Combinations into this form and filter on that
keyword = 'DOSA'
s = df.explode('Combinations')
s.loc[s.Combinations.eq('keyword').groupby(level=0).transform('any') & s.Combinations.ne('keyword')]
Or chain the two commands with .loc[lambda ]:
(df.explode('Combinations')
.loc[lambda x: x.Combinations.ne(keyword) &
x.Combinations.eq(keyword).groupby(level=0).transform('any')]
)
Output:
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141

What I will do
x=df.explode('Combinations')
x=x.loc[x.index[x.Combinations=='DOSA']].query('Combinations !="DOSA"')
x
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141

you can also try creating a dataframe as a reference , then mask where keyword matches with stack for dropping NaN:
keyword = 'DOSA'
m = pd.DataFrame(df['Combinations'].tolist(),index=df.index)
c = m.eq(keyword).any(1)
df[m.eq(keyword).any(1)].assign(Combinations=
m[c].where(m[c].ne(keyword)).stack().droplevel(1))
Combinations Count
6 IDLY 2020
9 CHAPPATHI 1297
13 VADA 1141
For string type, you can convert into tuple by:
import ast
df['Combinations'] = df['Combinations'].apply(ast.literal_eval)

d = df[df['Combinations'].transform(lambda x: 'DOSA' in x)].copy()
d['Combinations'] = d['Combinations'].apply(lambda x: set(x).difference(['DOSA']).pop())
print(d)
Prints:
ID Combinations Count
1 6 IDLY 2020
2 9 CHAPPATHI 1297
5 13 VADA 1141

Related

Replace every nth row in df1 with every row from df 2

(Absolute beginner here)
The following code should replace every 9th row of the template df with EVERY row of the data df. However it replaces every 9th row of template with every 9th row of data.
template.iloc[::9, 2] = data['Question (en)']
template.iloc[::9, 3] = data['Correct Answer']
template.iloc[::9, 4] = data['Incorrect Answer 1']
template.iloc[::9, 5] = data['Incorrect Answer 2']
Thank you for your help

The source of the problem with your code is that the initial step to
any operation on 2 DataFrames is their alignment by indices.
To avoid this step, take the underlying Numpy array from one of arrays, invoking values.
Since Numpy array has no index, Pandas can't perform the mentioned alignment.
Another correction is:
to take from the second DataFrame only as many rows as it is needed,
and only these columns that are to be saved in the target array,
perform the whole update "in one go" (see the code below).
To create both source test arrays, I defined the following function:
def getTestDf(nRows : int, tt : str, valShift=0):
qn = np.array(list(map(lambda i: tt + str(i),np.arange(nRows, dtype=int))))
ans = np.arange(nRows * 3, dtype=int).reshape((-1, 3)) + valShift
return pd.concat([pd.DataFrame({'Question (en)' : qn}), pd.DataFrame(ans,
columns=['Correct Answer', 'Incorrect Answer 1', 'Incorrect Answer 2'])], axis=1)
and called it:
template = getTestDf(80, 'Question_')
data = getTestDf(9, 'New question ', 1000)
Note that after I created template I counted that just 9 rows in data
are needed, so I created data with just 9 rows.
This way the initial part of template contains:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 Question_0 0 1 2
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
...
and data (in full):
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 New question 1 1003 1004 1005
2 New question 2 1006 1007 1008
3 New question 3 1009 1010 1011
4 New question 4 1012 1013 1014
5 New question 5 1015 1016 1017
6 New question 6 1018 1019 1020
7 New question 7 1021 1022 1023
8 New question 8 1024 1025 1026
Now, to copy selected rows, run just:
template.iloc[::9] = data.values
The initial part of template contains now:
Question (en) Correct Answer Incorrect Answer 1 Incorrect Answer 2
0 New question 0 1000 1001 1002
1 Question_1 3 4 5
2 Question_2 6 7 8
3 Question_3 9 10 11
4 Question_4 12 13 14
5 Question_5 15 16 17
6 Question_6 18 19 20
7 Question_7 21 22 23
8 Question_8 24 25 26
9 New question 1 1003 1004 1005
10 Question_10 30 31 32
11 Question_11 33 34 35
12 Question_12 36 37 38
13 Question_13 39 40 41
14 Question_14 42 43 44
15 Question_15 45 46 47
16 Question_16 48 49 50
17 Question_17 51 52 53
18 New question 2 1006 1007 1008
19 Question_19 57 58 59

I am pretty sure that there are simpler/nicer ways, but just off the top of my head:
template_9=template.iloc[::9,0:2].copy()
# outer join
template_9['key'] = 0
data['key'] = 0
template_9.merge(data, how='left') # you don't need left here, but I think it's clearer
template_9.drop('key', axis=1, inplace=True)
template = pd.concat([template,template_9]).drop_duplicates(keep='last')
In case you want to keep the index replace:
template_9.reset_index().merge(data, how='left').set_index('index')
and then you can sort by index in the end.
P.S. I'm assuming column names are the same in both data frames, but it should be straightforward to adapt it anyway.

Why i'm not getting my whole output in the run module?

I'm not getting my whole output as well as my column names in my Screen.
import sqlite3
import pandas as pd
hello = sqlite3.connect(r"C:\Users\ravjo\Downloads\Chinook.sqlite")
rs = hello.execute("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000")
df = pd.DataFrame(rs.fetchall())
hello.close()
print(df.head())
actual result:
0 1 2 3 4 ... 6 7 8 9 10
0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99
3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99
4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99
[5 rows x 11 columns]
expected result:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
0 1 3390 3390 One and the Same 271 2
1 1 3392 3392 Until We Fall 271 2
2 1 3393 3393 Original Fire 271 2
3 1 3394 3394 Broken City 271 2
4 1 3395 3395 Somedays 271 2
GenreId Composer Milliseconds Bytes UnitPrice
0 23 None 217732 3559040 0.99
1 23 None 230758 3766605 0.99
2 23 None 218916 3577821 0.99
3 23 None 228366 3728955 0.99
4 23 None 213831 3497176 0.99

The ... in the middle actually says that some of the data have been omitted from display. If you want to see the entire data, you should modify the pandas options. You can do so by using pandas.set_option() method. Documentation here.
In your case, you should set display.max_columns to None so that pandas displays unlimited number of columns. You will have to read in the column names from the database of set it manually. Refer here on how to read in the column names from the database itself.

To display all the columns please use below mentioned code snippet.
pd.set_option("display.max_columns",None)

By default, pandas limits number of rows for display. However you can change it to as per your need. Here is helper function I use, whenever I need to print full data-frame
def print_full(df):
import pandas as pd
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

How to place each elements of any column exactlty underneath each other? [duplicate]

This question already has answers here:
Create nice column output in python
(22 answers)
Closed 5 years ago.
I have a problem that in the output of my code;
elements of each column does not place exactly beneath each other.
My original code is too busy, so I reduce it to a simple one;
so at first les's explain this simple one:
At first consider one simple question as follows:
Write a code which recieves a natural number r, as number of rows;
and recieves another natural number c, as number of columns;
and then print all natural numbers
form 1 to rc in r rows and c columns.
So the code will be something like the following:
r = int(input("How many Rows? ")); ## here r stands for number of rows
c = int(input("How many columns? ")); ## here c stands for number of columns
for i in range(1,r+1):
for j in range (1,c+1):
print(j+c*(i-1)) ,
print
and the output is as follows:
How many Rows? 5
How many columns? 6
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
>>>
or:
How many Rows? 7
How many columns? 3
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
>>>
What should I do, to get an output like this?
How many Rows? 5
How many columns? 6
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
>>>
or
How many Rows? 7
How many columns? 3
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
>>>
Now my original code is somthing like the following:
def function(n):
R=0;
something...something...something...
something...something...something...
something...something...something...
something...something...something...
return(R)
r = int(input("How many Rows? ")); ## here r stands for number of rows
c = int(input("How many columns? ")); ## here c stands for number of columns
for i in range(0,r+1):
for j in range(0,c+1)
n=j+c*(i-1);
r=function(n);
print (r)
Now for simplicity, suppose that by some by-hand-manipulation we get:
f(1)=function(1)=17, f(2)=235, f(3)=-8;
f(4)=-9641, f(5)=54278249, f(6)=411;
Now when I run the code the out put is as follows:
How many Rows? 2
How many columns? 3
17
235
-8
-9641
54278249
41
>>>
What shold I do to get an output like this:
How many Rows? 2
How many columns? 3
17 235 -8
-9641 54278249 411
>>>
Also note that I did not want to get something like this:
How many Rows? 2
How many columns? 3
17 235 -8
-9641 54278249 411
>>>

Use rjust method:
r,c = 5,5
for i in range(1,r+1):
for j in range (1,c+1):
str_to_printout = str(j+c*(i-1)).rjust(2)
print(str_to_printout),
print
Result:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
UPD.
As for your last example, let's say f(n) is defined in this way:
def f(n):
my_dict = {1:17, 2:235, 3:-8, 4:-9641, 5:54278249, 6:411}
return my_dict.get(n, 0)
Then you can use the following approach:
r,c = 2,3
# data table with elemets in string format
data_str = [[str(f(j+c*(i-1))) for j in range (1,c+1)] for i in range(1,r+1)]
# transposed data table and list of max len for every column in data_str
data_str_transposed = [list(i) for i in zip(*data_str)]
max_len_columns = [max(map(len, col)) for col in data_str_transposed]
# printing out
# the string " " before 'join' is a delimiter between columns
for row in data_str:
print(" ".join(elem.rjust(max_len) for elem, max_len in zip(row, max_len_columns)))
Result:
17 235 -8
-9641 54278249 411
With r,c = 3,3:
17 235 -8
-9641 54278249 411
0 0 0
Note that the indent in each column corresponds to the maximum length in this column, and not in the entire table.

Hope this helps. Please comment if you need any further clarifications.
# result stores the final matrix
# max_len stores the length of maximum element
result, max_len = [], 0
for i in range(1, r + 1):
temp = []
for j in range(1, c + 1):
n = j + c * (i - 1);
r = function(n);
if len(str(r)) > max_len:
max_len = len(str(r))
temp.append(r)
result.append(temp)
# printing the values seperately to apply rjust() to each and every element
for i in result:
for j in i:
print(str(j).rjust(max_len), end=' ')
print()

Adopted from MaximTitarenko's answer:
You first look for the minimum and maximum value, then decide which is the longer one and use its length as the value for the rjust(x) call.
import random
r,c = 15,5
m = random.sample(xrange(10000), 100)
length1 = len(str(max(m)))
length2 = len(str(min(m)))
longest = max(length1, length2)
for i in range(r):
for j in range (c):
str_to_printout = str(m[i*c+j]).rjust(longest)
print(str_to_printout),
print
Example output:
937 9992 8602 4213 7053
1957 9766 6704 8051 8636
267 889 1903 8693 5565
8287 7842 6933 2111 9689
3948 428 8894 7522 417
3708 8033 878 4945 2771
6393 35 9065 2193 6797
5430 2720 647 4582 3316
9803 1033 7864 656 4556
6751 6342 4915 5986 6805
9490 2325 5237 8513 8860
8400 1789 2004 4500 2836
8329 4322 6616 132 7198
4715 193 2931 3947 8288
1338 9386 5036 4297 2903

You need to use the string method .rjust
From the documentation (linked above):
string.rjust(s, width[, fillchar])
This function right-justifies a string in a field of given width. It returns a string that is at least width characters wide, created by padding the string with the character fillchar (default is a space) until the given width on the right. The string is never truncated.
So we need to calculate what the width (in characters) each number should be padded to. That is pretty simple, just the number of rows * number of columns + 1 (the +1 adds a one-space gab between each column).
Using this, it becomes quite simple to write the code:
r = int(input("How many Rows? "))
c = int(input("How many columns? "))
width = len(str(r*c)) + 1
for i in range(1,r+1):
for j in range(1,c+1):
print str(j+c*(i-1)).rjust(width) ,
print
which for an r, c of 4, 5 respectively, outputs:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Hopefully this helps you out and you can adapt this to other situations yourself!

Applying matrix product in specific pandas columns

I have a pandas DataFrame structured in the following way
0 1 2 3 4 5 6 7 8 9
0 42 2012 106 1200 0.112986 -0.647709 -0.303534 31.73 14.80 1096
1 42 2012 106 1200 0.185159 -0.588728 -0.249392 31.74 14.80 1097
2 42 2012 106 1200 0.199910 -0.547780 -0.226356 31.74 14.80 1096
3 42 2012 106 1200 0.065741 -0.796107 -0.099782 31.70 14.81 1097
4 42 2012 106 1200 0.116718 -0.780699 -0.043169 31.66 14.78 1094
5 42 2012 106 1200 0.280035 -0.788511 -0.171763 31.66 14.79 1094
6 42 2012 106 1200 0.311319 -0.663151 -0.271162 31.78 14.79 1094
In which columns 4, 5 and 6 are actually the components of a vector. I want to apply a matrix multiplication in these columns, that is to replace columns 4, 5 and 6 with the vector resulting of a the multiplication of the previous vector with a matrix.
What I did was
DC=[[ .. definition of multiplication matrix .. ]]
def rotate(vector):
return dot(DC, vector)
data[[4,5,6]]=data[[4,5,6]].apply(rotate, axis='columns')
Which I thought should work, but the returned DataFrame is exactly the same as the original.
What am I missing here?

You code is correct but very slow. You can use values property to get the ndarray and use dot() to transform all the vectors at once:
import numpy as np
import pandas as pd
DC = np.random.randn(3, 3)
df = pd.DataFrame(np.random.randn(1000, 10))
df2 = df.copy()
df[[4,5,6]] = np.dot(DC, df[[4,5,6]].values.T).T
def rotate(vector):
return np.dot(DC, vector)
df2[[4,5,6]] = df2[[4,5,6]].apply(rotate, axis='columns')
df.equals(df2)
On my PC, it's about 90x faster.

Pandas dataframe with multiindex column - merge levels

I have a dataframe, grouped, with multiindex columns as below:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
>> grouped
code colour size scaled_size
sum average size idxmax sum average size idxmax
0 one black 1031 60.647059 17 81 185.153944 10.891408 17 47
1 one white 481 37.000000 13 53 204.139249 15.703019 13 53
2 three black 822 48.352941 17 6 123.269405 7.251141 17 31
3 three white 1614 57.642857 28 50 285.638337 10.201369 28 37
4 two black 523 58.111111 9 85 80.908912 8.989879 9 88
5 two white 669 41.812500 16 78 82.098870 5.131179 16 78
[6 rows x 10 columns]
How can I flatten/merge the column index levels as: "Level1|Level2", e.g. size|sum, scaled_size|sum. etc? If this is not possible, is there a way to groupby() as I did above without creating multi-index columns?

There is potentially a better way, more pythonic way to flatten MultiIndex columns.
1. Use map and join with string column headers:
grouped.columns = grouped.columns.map('|'.join).str.strip('|')
print(grouped)
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 862 53.875000 16 14
1 one white 554 46.166667 12 18
2 three black 842 49.529412 17 90
3 three white 740 56.923077 13 97
4 two black 1541 61.640000 25 50
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 6980 436.250000 16 77
1 6101 508.416667 12 13
2 7889 464.058824 17 64
3 6329 486.846154 13 73
4 12809 512.360000 25 23
2. Use map with format for column headers that have numeric data types.
grouped.columns = grouped.columns.map('{0[0]}|{0[1]}'.format)
Output:
code| colour| size|sum size|average size|size size|idxmax \
0 one black 734 52.428571 14 30
1 one white 1110 65.294118 17 88
2 three black 930 51.666667 18 3
3 three white 1140 51.818182 22 20
4 two black 656 38.588235 17 77
5 two white 704 58.666667 12 17
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 8229 587.785714 14 57
1 8781 516.529412 17 73
2 10743 596.833333 18 21
3 10240 465.454545 22 26
4 9982 587.176471 17 16
5 6537 544.750000 12 49
3. Use list comprehension with f-string for Python 3.6+:
grouped.columns = [f'{i}|{j}' if j != '' else f'{i}' for i,j in grouped.columns]
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 1003 43.608696 23 76
1 one white 1255 59.761905 21 66
2 three black 777 45.705882 17 39
3 three white 630 52.500000 12 23
4 two black 823 54.866667 15 33
5 two white 491 40.916667 12 64
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 12532 544.869565 23 27
1 13223 629.666667 21 13
2 8615 506.764706 17 92
3 6101 508.416667 12 43
4 7661 510.733333 15 42
5 6143 511.916667 12 49

you could always change the columns:
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]

Based on Scott Boston's answer,
little update(it will be work for 2 or more levels column):
temp.columns.map(lambda x: '|'.join([str(i) for i in x]))
Thank you, Boston!

Full credit to suraj's concise answer: https://stackoverflow.com/a/72616083/317797
df.columns = df.columns.map('_'.join)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to create a DF from a DF based on a condition - python

What I will do x=df.explode('Combinations') x=x.loc[x.index[x.Combinations=='DOSA']].query('Combinations !="DOSA"') x Combinations Count 6 IDLY 2020 9 CHAPPATHI 1297 13 VADA 1141

d = df[df['Combinations'].transform(lambda x: 'DOSA' in x)].copy() d['Combinations'] = d['Combinations'].apply(lambda x: set(x).difference(['DOSA']).pop()) print(d) Prints: ID Combinations Count 1 6 IDLY 2020 2 9 CHAPPATHI 1297 5 13 VADA 1141

Related

Replace every nth row in df1 with every row from df 2

Why i'm not getting my whole output in the run module?

How to place each elements of any column exactlty underneath each other? [duplicate]

Applying matrix product in specific pandas columns

Pandas dataframe with multiindex column - merge levels

Categories

Resources