extracting upper and lower row if a condition is met - python

Regards.
I have the following coordinate dataframe, divided by blocks. Each block starts at seq0_leftend, seq0_rightend, seq1_leftend, seq1_rightend, seq2_leftend, seq2_rightend, seq3_leftend, seq3_rightend, and so on. I would like that, for each block given the condition if, coordinates are negative, extract the upper and lower row. example of my dataframe file:
seq0_leftend seq0_rightend
0 7 107088
1 107089 108940
2 108941 362759
3 362760 500485
4 500486 509260
5 509261 702736
seq1_leftend seq1_rightend
0 1 106766
1 106767 108619
2 108620 355933
3 355934 488418
4 488419 497151
5 497152 690112
6 690113 700692
7 700693 721993
8 721994 722347
9 722348 946296
10 946297 977714
11 977715 985708
12 -985709 -990725
13 991992 1042023
14 1042024 1259523
15 1259524 1261239
seq2_leftend seq2_rightend
0 1 109407
1 362514 364315
2 109408 362513
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
13 1015418 1069976
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
0 1 112030
1 112031 113882
2 113883 381662
3 381663 519575
4 519576 528317
5 528318 724500
6 724501 735077
7 735078 759456
8 759457 763157
9 763158 996929
10 996931 1034492
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
14 1125427 1353901
15 1353902 1356209
16 1356210 1392818
seq4_leftend seq4_rightend
0 1 105722
1 105723 107575
2 107576 355193
3 355194 487487
4 487488 496220
5 496221 689560
6 689561 700139
7 700140 721438
8 721458 721497
9 721498 947183
10 947184 978601
11 978602 986595
12 -986596 -991612
13 994605 1046245
14 1046247 1264692
15 1264693 1266814
Finally write a new csv with the data of interest, an example of the final result that I would like, would be this:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
seq4_leftend seq4_rightend
11 978602 986595
12 -986596 -991612
13 994605 1046245

I assume that you have a list of DataFrames, let's call it src.
To convert a single DataFrame, define the following function:
def findRows(df):
col = df.iloc[:, 0]
if col.lt(0).any():
return df[col.lt(0) | col.shift(1).lt(0) | col.shift(-1).lt(0)]
else:
return None
Note that this function starts with reading column 0 from the source
DataFrame, so it is independent of the name of this column.
Then it checks whether any element in this column is < 0.
If found, the returned object is a DataFrame with rows which
contain a value < 0:
either in this element,
or in the previous element,
or in the next element.
If not found, this function returns None (from your expected result
I see that in such a case you don't want even any empty DataFrame).
The first stage is to collect results of this function called on each
DataFrame from src:
result = [ findRows(df) for df in src ]
An the last part is to filter out elements which are None:
result = list(filter(None.__ne__, result))
To see the result, run:
for df in result:
print(df)
For src containing first 3 of your DataFrames, I got:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
As you can see, the resulting list contains only results
originating from the second and third source DataFrame.
The first was filtered out, since findRows returned
None from its processing.

Related

How to change a list of synsets to list elements?

I have tried out the following snippet of code for my project:
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
df=[]
hypo = wn.synset('science.n.01').hyponyms()
hyper = wn.synset('science.n.01').hypernyms()
mero = wn.synset('science.n.01').part_meronyms()
holo = wn.synset('science.n.01').part_holonyms()
ent = wn.synset('science.n.01').entailments()
df = df+hypo+hyper+mero+holo+ent
df_agri_clean = pd.DataFrame(df)
df_agri_clean.columns=["Items"]
print(df_agri_clean)
pd.set_option('display.expand_frame_repr', False)
It has given me this output of a dataframe:
Items
0 Synset('agrobiology.n.01')
1 Synset('agrology.n.01')
2 Synset('agronomy.n.01')
3 Synset('architectonics.n.01')
4 Synset('cognitive_science.n.01')
5 Synset('cryptanalysis.n.01')
6 Synset('information_science.n.01')
7 Synset('linguistics.n.01')
8 Synset('mathematics.n.01')
9 Synset('metallurgy.n.01')
10 Synset('metrology.n.01')
11 Synset('natural_history.n.01')
12 Synset('natural_science.n.01')
13 Synset('nutrition.n.03')
14 Synset('psychology.n.01')
15 Synset('social_science.n.01')
16 Synset('strategics.n.01')
17 Synset('systematics.n.01')
18 Synset('thanatology.n.01')
19 Synset('discipline.n.01')
20 Synset('scientific_theory.n.01')
21 Synset('scientific_knowledge.n.01')
This can be converted to a list by just printing df.
[Synset('agrobiology.n.01'), Synset('agrology.n.01'), Synset('agronomy.n.01'), Synset('architectonics.n.01'), Synset('cognitive_science.n.01'), Synset('cryptanalysis.n.01'), Synset('information_science.n.01'), Synset('linguistics.n.01'), Synset('mathematics.n.01'), Synset('metallurgy.n.01'), Synset('metrology.n.01'), Synset('natural_history.n.01'), Synset('natural_science.n.01'), Synset('nutrition.n.03'), Synset('psychology.n.01'), Synset('social_science.n.01'), Synset('strategics.n.01'), Synset('systematics.n.01'), Synset('thanatology.n.01'), Synset('discipline.n.01'), Synset('scientific_theory.n.01'), Synset('scientific_knowledge.n.01')]
I wish to change every word under "Items" like so :
Synset('agrobiology.n.01') => agrobiology.n.01
or
Synset('agrobiology.n.01') => 'agrobiology'
Any answer associated will be appreciated! Thanks!
To access the name of these items, just do function.name(). You could use line comprehension update these items as follows:
df_agri_clean['Items'] = [df_agri_clean['Items'][i].name() for i in range(len(df_agri_clean))]
df_agri_clean
The output will be as you expected
Items
0 agrobiology.n.01
1 agrology.n.01
2 agronomy.n.01
3 architectonics.n.01
4 cognitive_science.n.01
5 cryptanalysis.n.01
6 information_science.n.01
7 linguistics.n.01
8 mathematics.n.01
9 metallurgy.n.01
10 metrology.n.01
11 natural_history.n.01
12 natural_science.n.01
13 nutrition.n.03
14 psychology.n.01
15 social_science.n.01
16 strategics.n.01
17 systematics.n.01
18 thanatology.n.01
19 discipline.n.01
20 scientific_theory.n.01
21 scientific_knowledge.n.01
To further replace ".n.01" as well from the string, you could do the following:
df_agri_clean['Items'] = [df_agri_clean['Items'][i].name().replace('.n.01', '') for i in range(len(df_agri_clean))]
df_agri_clean
Output (just like your second expected output)
Items
0 agrobiology
1 agrology
2 agronomy
3 architectonics
4 cognitive_science
5 cryptanalysis
6 information_science
7 linguistics
8 mathematics
9 metallurgy
10 metrology
11 natural_history
12 natural_science
13 nutrition.n.03
14 psychology
15 social_science
16 strategics
17 systematics
18 thanatology
19 discipline
20 scientific_theory
21 scientific_knowledge

Sorting an Dataframe wtih Index in form of List

I have a pandas df that I need to sort based on a fixed order given in a list. The problem that I'm having is that the sort I'm attempting is not moving the data rows in the designated order that I'm expecting from the order of the list. My list and DataFrame (df) looks like this:
months = ['5','6','7','8','9','10','11','12','1','2']
df =
year 1992 1993 1994 1995
month
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
The closest that I have gotten is this -
newdf = pd.DataFrame(df.values, index=list(months))
...but it does not move the rows. This command only adds the months in the index column w/out moving the data.
0 1 2 3
5 -0.343107 -0.211959 0.437974 -1.219363
6 -0.383353 0.888650 1.054926 0.714846
7 0.057198 1.246682 0.042684 0.275701
8 -0.100018 -0.801554 0.001111 0.382633
9 -0.283815 0.204448 0.350705 0.130652
10 0.042195 -0.433849 -1.481228 -0.236004
11 1.059776 0.875214 0.304638 0.127819
12 -0.328911 -0.256656 1.081157 1.057449
1 -0.488213 -0.957050 -0.813885 1.403822
2 0.973031 -0.246714 0.600157 0.579038
I need the result to look like -
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
Assuming df.index is dtype('int64'), first convert months to integers. Then use loc:
months = [*map(int, months)]
out = df.loc[months]
If df.index is dtype('O'), you can use loc right away, i.e. you don't need the first line.
Output:
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846

What does the last line do?

apb = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i in range(26):
s = apb[i:26] + apb[0:i]
print("{:2d} {} ".format(i, s))
Supposed to output this
Sorry just started learning python and this can seem like a dumb question. I tried googling but it keeps telling me it has something to do with 2d array and I definietly know thats not the answer I am looking for.
I understand everything until the last line.
What does: print("{:2d} {} ".format(i, s)) do?
The format function replaces {} (a placeholder) with a variable. {:2d} is similar to the %2d printf format specifier in the C language where it reserves 2 spaces on the console for the variable. For example '{:2d}'.format(2) would print ' 2'. If you want, you can use {}, which would yield the same result, but the letters would not be aligned the same. With {:2d}:
0 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1 BCDEFGHIJKLMNOPQRSTUVWXYZA
2 CDEFGHIJKLMNOPQRSTUVWXYZAB
3 DEFGHIJKLMNOPQRSTUVWXYZABC
4 EFGHIJKLMNOPQRSTUVWXYZABCD
5 FGHIJKLMNOPQRSTUVWXYZABCDE
6 GHIJKLMNOPQRSTUVWXYZABCDEF
7 HIJKLMNOPQRSTUVWXYZABCDEFG
8 IJKLMNOPQRSTUVWXYZABCDEFGH
9 JKLMNOPQRSTUVWXYZABCDEFGHI
10 KLMNOPQRSTUVWXYZABCDEFGHIJ
11 LMNOPQRSTUVWXYZABCDEFGHIJK
12 MNOPQRSTUVWXYZABCDEFGHIJKL
13 NOPQRSTUVWXYZABCDEFGHIJKLM
14 OPQRSTUVWXYZABCDEFGHIJKLMN
15 PQRSTUVWXYZABCDEFGHIJKLMNO
16 QRSTUVWXYZABCDEFGHIJKLMNOP
17 RSTUVWXYZABCDEFGHIJKLMNOPQ
18 STUVWXYZABCDEFGHIJKLMNOPQR
19 TUVWXYZABCDEFGHIJKLMNOPQRS
20 UVWXYZABCDEFGHIJKLMNOPQRST
21 VWXYZABCDEFGHIJKLMNOPQRSTU
22 WXYZABCDEFGHIJKLMNOPQRSTUV
23 XYZABCDEFGHIJKLMNOPQRSTUVW
24 YZABCDEFGHIJKLMNOPQRSTUVWX
25 ZABCDEFGHIJKLMNOPQRSTUVWXY
With {}:
0 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1 BCDEFGHIJKLMNOPQRSTUVWXYZA
2 CDEFGHIJKLMNOPQRSTUVWXYZAB
3 DEFGHIJKLMNOPQRSTUVWXYZABC
4 EFGHIJKLMNOPQRSTUVWXYZABCD
5 FGHIJKLMNOPQRSTUVWXYZABCDE
6 GHIJKLMNOPQRSTUVWXYZABCDEF
7 HIJKLMNOPQRSTUVWXYZABCDEFG
8 IJKLMNOPQRSTUVWXYZABCDEFGH
9 JKLMNOPQRSTUVWXYZABCDEFGHI
10 KLMNOPQRSTUVWXYZABCDEFGHIJ
11 LMNOPQRSTUVWXYZABCDEFGHIJK
12 MNOPQRSTUVWXYZABCDEFGHIJKL
13 NOPQRSTUVWXYZABCDEFGHIJKLM
14 OPQRSTUVWXYZABCDEFGHIJKLMN
15 PQRSTUVWXYZABCDEFGHIJKLMNO
16 QRSTUVWXYZABCDEFGHIJKLMNOP
17 RSTUVWXYZABCDEFGHIJKLMNOPQ
18 STUVWXYZABCDEFGHIJKLMNOPQR
19 TUVWXYZABCDEFGHIJKLMNOPQRS
20 UVWXYZABCDEFGHIJKLMNOPQRST
21 VWXYZABCDEFGHIJKLMNOPQRSTU
22 WXYZABCDEFGHIJKLMNOPQRSTUV
23 XYZABCDEFGHIJKLMNOPQRSTUVW
24 YZABCDEFGHIJKLMNOPQRSTUVWX
25 ZABCDEFGHIJKLMNOPQRSTUVWXY

Is there a way to rank some items in a pandas dataframe and exclude others?

I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))

How to place each elements of any column exactlty underneath each other? [duplicate]

This question already has answers here:
Create nice column output in python
(22 answers)
Closed 5 years ago.
I have a problem that in the output of my code;
elements of each column does not place exactly beneath each other.
My original code is too busy, so I reduce it to a simple one;
so at first les's explain this simple one:
At first consider one simple question as follows:
Write a code which recieves a natural number r, as number of rows;
and recieves another natural number c, as number of columns;
and then print all natural numbers
form 1 to rc in r rows and c columns.
So the code will be something like the following:
r = int(input("How many Rows? ")); ## here r stands for number of rows
c = int(input("How many columns? ")); ## here c stands for number of columns
for i in range(1,r+1):
for j in range (1,c+1):
print(j+c*(i-1)) ,
print
and the output is as follows:
How many Rows? 5
How many columns? 6
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
>>>
or:
How many Rows? 7
How many columns? 3
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
>>>
What should I do, to get an output like this?
How many Rows? 5
How many columns? 6
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
>>>
or
How many Rows? 7
How many columns? 3
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
>>>
Now my original code is somthing like the following:
def function(n):
R=0;
something...something...something...
something...something...something...
something...something...something...
something...something...something...
return(R)
r = int(input("How many Rows? ")); ## here r stands for number of rows
c = int(input("How many columns? ")); ## here c stands for number of columns
for i in range(0,r+1):
for j in range(0,c+1)
n=j+c*(i-1);
r=function(n);
print (r)
Now for simplicity, suppose that by some by-hand-manipulation we get:
f(1)=function(1)=17, f(2)=235, f(3)=-8;
f(4)=-9641, f(5)=54278249, f(6)=411;
Now when I run the code the out put is as follows:
How many Rows? 2
How many columns? 3
17
235
-8
-9641
54278249
41
>>>
What shold I do to get an output like this:
How many Rows? 2
How many columns? 3
17 235 -8
-9641 54278249 411
>>>
Also note that I did not want to get something like this:
How many Rows? 2
How many columns? 3
17 235 -8
-9641 54278249 411
>>>
Use rjust method:
r,c = 5,5
for i in range(1,r+1):
for j in range (1,c+1):
str_to_printout = str(j+c*(i-1)).rjust(2)
print(str_to_printout),
print
Result:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
UPD.
As for your last example, let's say f(n) is defined in this way:
def f(n):
my_dict = {1:17, 2:235, 3:-8, 4:-9641, 5:54278249, 6:411}
return my_dict.get(n, 0)
Then you can use the following approach:
r,c = 2,3
# data table with elemets in string format
data_str = [[str(f(j+c*(i-1))) for j in range (1,c+1)] for i in range(1,r+1)]
# transposed data table and list of max len for every column in data_str
data_str_transposed = [list(i) for i in zip(*data_str)]
max_len_columns = [max(map(len, col)) for col in data_str_transposed]
# printing out
# the string " " before 'join' is a delimiter between columns
for row in data_str:
print(" ".join(elem.rjust(max_len) for elem, max_len in zip(row, max_len_columns)))
Result:
17 235 -8
-9641 54278249 411
With r,c = 3,3:
17 235 -8
-9641 54278249 411
0 0 0
Note that the indent in each column corresponds to the maximum length in this column, and not in the entire table.
Hope this helps. Please comment if you need any further clarifications.
# result stores the final matrix
# max_len stores the length of maximum element
result, max_len = [], 0
for i in range(1, r + 1):
temp = []
for j in range(1, c + 1):
n = j + c * (i - 1);
r = function(n);
if len(str(r)) > max_len:
max_len = len(str(r))
temp.append(r)
result.append(temp)
# printing the values seperately to apply rjust() to each and every element
for i in result:
for j in i:
print(str(j).rjust(max_len), end=' ')
print()
Adopted from MaximTitarenko's answer:
You first look for the minimum and maximum value, then decide which is the longer one and use its length as the value for the rjust(x) call.
import random
r,c = 15,5
m = random.sample(xrange(10000), 100)
length1 = len(str(max(m)))
length2 = len(str(min(m)))
longest = max(length1, length2)
for i in range(r):
for j in range (c):
str_to_printout = str(m[i*c+j]).rjust(longest)
print(str_to_printout),
print
Example output:
937 9992 8602 4213 7053
1957 9766 6704 8051 8636
267 889 1903 8693 5565
8287 7842 6933 2111 9689
3948 428 8894 7522 417
3708 8033 878 4945 2771
6393 35 9065 2193 6797
5430 2720 647 4582 3316
9803 1033 7864 656 4556
6751 6342 4915 5986 6805
9490 2325 5237 8513 8860
8400 1789 2004 4500 2836
8329 4322 6616 132 7198
4715 193 2931 3947 8288
1338 9386 5036 4297 2903
You need to use the string method .rjust
From the documentation (linked above):
string.rjust(s, width[, fillchar])
This function right-justifies a string in a field of given width. It returns a string that is at least width characters wide, created by padding the string with the character fillchar (default is a space) until the given width on the right. The string is never truncated.
So we need to calculate what the width (in characters) each number should be padded to. That is pretty simple, just the number of rows * number of columns + 1 (the +1 adds a one-space gab between each column).
Using this, it becomes quite simple to write the code:
r = int(input("How many Rows? "))
c = int(input("How many columns? "))
width = len(str(r*c)) + 1
for i in range(1,r+1):
for j in range(1,c+1):
print str(j+c*(i-1)).rjust(width) ,
print
which for an r, c of 4, 5 respectively, outputs:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Hopefully this helps you out and you can adapt this to other situations yourself!

Categories