Convert list to dataframe

Convert list to dataframe - python

I am running a loop that appends three fields. Predictfinal is a list, though it is not necessary that it should be a list.
predictfinal.append(y_hat_orig[0])
predictfinal.append(mape)
predictfinal.append(length)
At the end, predictfinal returns a long list. But I really want to conform the list into a Dataframe, where each row is 3 columns. However the list does not designate between the 3 columns, it's just a long list with commas in between. Somehow I am trying to slice predictfinal into 3 columns and a Dataframe from currnet unstructured list - any help how?
predictfinal
Out[88]:
[1433.0459967608983,
1.6407741379111223,
23,
1433.6389125340916,
1.6474721044455922,
22,
1433.867408791692,
1.6756763089082383,
21,
1433.8484984008207,
1.6457581105556003,
20,
1433.6340460965778,
1.6380908467895527,
19,
1437.0294365907992,
1.6147672264908473,
18,
1439.7485102740507,
1.5010415925555876,
17,
1440.950406295299,
1.433891246672529,
16,
1434.837060644701,
1.5252803314930383,
15,
1434.9716303636983,
1.6125952442799232,
14,
1441.3153523102953,
3.2633984339696185,
13,
1435.6932462859334,
3.2703435261200497,
12,
1419.9057834496082,
1.9100005818319687,
11,
1426.0739741342488,
1.947684057178654,
10]

Based on https://stackoverflow.com/a/48347320/6926444
We can achieve it by using zip() and iter(). The code below iterates three elements each time.
res = pd.DataFrame(list(zip(*([iter(data)] * 3))), columns=['a', 'b', 'c'])
Result:
a b c
0 1433.045997 1.640774 23
1 1433.638913 1.647472 22
2 1433.867409 1.675676 21
3 1433.848498 1.645758 20
4 1433.634046 1.638091 19
5 1437.029437 1.614767 18
6 1439.748510 1.501042 17
7 1440.950406 1.433891 16
8 1434.837061 1.525280 15
9 1434.971630 1.612595 14
10 1441.315352 3.263398 13
11 1435.693246 3.270344 12
12 1419.905783 1.910001 11
13 1426.073974 1.947684 10

You could do:
pd.DataFrame(np.array(predictfinal).reshape(-1,3), columns=['origin', 'mape', 'length'])
Output:
origin mape length
0 1433.045997 1.640774 23.0
1 1433.638913 1.647472 22.0
2 1433.867409 1.675676 21.0
3 1433.848498 1.645758 20.0
4 1433.634046 1.638091 19.0
5 1437.029437 1.614767 18.0
6 1439.748510 1.501042 17.0
7 1440.950406 1.433891 16.0
8 1434.837061 1.525280 15.0
9 1434.971630 1.612595 14.0
10 1441.315352 3.263398 13.0
11 1435.693246 3.270344 12.0
12 1419.905783 1.910001 11.0
13 1426.073974 1.947684 10.0
Or you can also modify your loop:
predictfinal = []
for i in some_list:
predictfinal.append([y_hat_orig[0], mape, length])
# output dataframe
pd.DataFrame(predictfinal, columns=['origin', 'mape', 'length'])

Here is a pandas solution
s=pd.Series(l)
s.index=pd.MultiIndex.from_product([range(len(l)//3),['origin','map','len']])
s=s.unstack()
Out[268]:
len map origin
0 23.0 1.640774 1433.045997
1 22.0 1.647472 1433.638913
2 21.0 1.675676 1433.867409
3 20.0 1.645758 1433.848498
4 19.0 1.638091 1433.634046
5 18.0 1.614767 1437.029437
6 17.0 1.501042 1439.748510
7 16.0 1.433891 1440.950406
8 15.0 1.525280 1434.837061
9 14.0 1.612595 1434.971630
10 13.0 3.263398 1441.315352
11 12.0 3.270344 1435.693246
12 11.0 1.910001 1419.905783
13 10.0 1.947684 1426.073974

Related

delete redundant rows in a dataframe with set in columns

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!

Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

Iterate through df row and append to a list w/o name and dtype

I have a df that has 24 cols and I want to iterate through each row and append to a list consecutively.
Code below does the job - but it also appends on the index value and, name, and dtype, which I need to remove.
results = []
for row in data.iterrows():
results.append(row)
(0, 1 11.87
2 7.60
3 0.32
4 3.11
5 47.43
6 47.81
7 24.74
8 32.57
9 39.49
10 24.48
11 18.14
12 26.52
13 14.17
14 13.45
15 17.80
16 17.89
17 27.39
18 51.55
19 60.22
20 69.64
21 75.97
22 67.45
23 52.88
24 53.25
Name: 0, dtype: float64)
(1, 1 54.49
2 51.67
3 53.68
4 33.81
5 26.99
6 25.80
7 36.35
8 28.85
9 26.01
10 8.44
11 1.64
12 8.01
13 23.41
14 16.22
15 16.30
16 8.90
17 1.93
18 0.00
19 2.79
20 30.24
21 55.58
22 62.79
23 74.70
24 68.46
Name: 1, dtype: float64)
It's similar to iterating through each row, transposing selected row, then adding them appending onto a list consecutively. If a df is (5, 24) then length of list will be 5*24 = 120.

You don't need to iterate through them. Try this:
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print(df)
c1 c2
0 10 100
1 11 110
2 12 120
Now you can use .values.ravel() to create a list of all dataframe values:
list(df.values.ravel())
Output:
[10, 100, 11, 110, 12, 120]

As your question is asked you are likely want a output in tuple/list with corresponding values for each row. the output you are asking for is not a flatten list.
pandas have good funtions to actually use numpy, numpy is a great module to work with when it comes to arrays/lists.
lets say you have a DataFrame called data in this case, if you use data.to_numpy() it will actually output a nested list with values for each row.
output:
[['joe' 'Doe' 34]
['bob' 'Warren' 20]
['Anna' 'Anderson' 10]]
you can even index your list like: data.to_numpy()[0].
you can even .flatten() your list like: data.to_numpy().flatten()
output with .flatten():
['joe' 'Doe' 34 'bob' 'Warren' 20 'Anna' 'Anderson' 10]
you can use a for loop to:
for i in data.to_numpy():
print(i)
this give you every list in the nested list.

extracting upper and lower row if a condition is met

Regards.
I have the following coordinate dataframe, divided by blocks. Each block starts at seq0_leftend, seq0_rightend, seq1_leftend, seq1_rightend, seq2_leftend, seq2_rightend, seq3_leftend, seq3_rightend, and so on. I would like that, for each block given the condition if, coordinates are negative, extract the upper and lower row. example of my dataframe file:
seq0_leftend seq0_rightend
0 7 107088
1 107089 108940
2 108941 362759
3 362760 500485
4 500486 509260
5 509261 702736
seq1_leftend seq1_rightend
0 1 106766
1 106767 108619
2 108620 355933
3 355934 488418
4 488419 497151
5 497152 690112
6 690113 700692
7 700693 721993
8 721994 722347
9 722348 946296
10 946297 977714
11 977715 985708
12 -985709 -990725
13 991992 1042023
14 1042024 1259523
15 1259524 1261239
seq2_leftend seq2_rightend
0 1 109407
1 362514 364315
2 109408 362513
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
13 1015418 1069976
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
0 1 112030
1 112031 113882
2 113883 381662
3 381663 519575
4 519576 528317
5 528318 724500
6 724501 735077
7 735078 759456
8 759457 763157
9 763158 996929
10 996931 1034492
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
14 1125427 1353901
15 1353902 1356209
16 1356210 1392818
seq4_leftend seq4_rightend
0 1 105722
1 105723 107575
2 107576 355193
3 355194 487487
4 487488 496220
5 496221 689560
6 689561 700139
7 700140 721438
8 721458 721497
9 721498 947183
10 947184 978601
11 978602 986595
12 -986596 -991612
13 994605 1046245
14 1046247 1264692
15 1264693 1266814
Finally write a new csv with the data of interest, an example of the final result that I would like, would be this:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
seq4_leftend seq4_rightend
11 978602 986595
12 -986596 -991612
13 994605 1046245

I assume that you have a list of DataFrames, let's call it src.
To convert a single DataFrame, define the following function:
def findRows(df):
col = df.iloc[:, 0]
if col.lt(0).any():
return df[col.lt(0) | col.shift(1).lt(0) | col.shift(-1).lt(0)]
else:
return None
Note that this function starts with reading column 0 from the source
DataFrame, so it is independent of the name of this column.
Then it checks whether any element in this column is < 0.
If found, the returned object is a DataFrame with rows which
contain a value < 0:
either in this element,
or in the previous element,
or in the next element.
If not found, this function returns None (from your expected result
I see that in such a case you don't want even any empty DataFrame).
The first stage is to collect results of this function called on each
DataFrame from src:
result = [ findRows(df) for df in src ]
An the last part is to filter out elements which are None:
result = list(filter(None.__ne__, result))
To see the result, run:
for df in result:
print(df)
For src containing first 3 of your DataFrames, I got:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
As you can see, the resulting list contains only results
originating from the second and third source DataFrame.
The first was filtered out, since findRows returned
None from its processing.

Is there a way to rank some items in a pandas dataframe and exclude others?

I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.

So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))

How do I separate arrays and add them based on their index in the array?

I am trying to make a wage calculator where a user inserts a .txt file and the program calculates the number of hours worked.
So far I am able to separate the names, wage value, and hours, but I can't figure out how to add the hours together.
So my desired result would be:
Names of Employees
Wage (how much they make
Added number of hours per employee
Here is the data set (file name of txt is -> empwages.txt):
(Edit: the formatting is messed so heres a screen grab of the text:
Spencer 12.75 8 8 8 8 10
Ruiz 18 8 8 9.5 8 8
Weiss 14.80 7 5 8 8 10
Choi 15 4 7 5 3.3 2.2
Miller 18 6.5 9 1 4 1
Barnes 15 7.5 9 4 0 2
Desired Outcome:
'Spencer', 'Ruiz', 'Weiss', 'Choi', 'Miller', 'Barnes'
'12.75', '18', '14.80', '15', '18', '15'
'42', '41.5', ... and so on
Current code:
infile = open("empwages.txt","r")
masterList = infile.readlines()
nameList = []
hourList = []
plushourList = []
for master in masterList:
nameList.append(master.split()[0])
hourList.append(master.split()[1])
x = 2
while x <= 6:
plushourList.append(master.split()[x])
x += 1
print(nameList)
print(hourList)
print(plushourList)

It is useful that you get familar with the concept of unpacking a list in Python. You can use the following code to solve your problem:
names = []
hours = []
more_hours = []
with open('empwages.txt') as f:
for line in f:
name, hour, *more_hs = line.split()
names.append(name)
hours.append(hour)
more_hours.append(more_hs)
print(*names, sep=', ')
print(*hours, sep=', ')
print(*[sum(float(q) for q in e) for e in more_hours])
In case you need the strings as you have requested:
names = []
hours = []
more_hours = []
with open('empwages.txt') as f:
for line in f:
name, hour, *more_hs = line.split()
names.append(name)
hours.append(hour)
more_hours.append(more_hs)
print(more_hours)
names = ', '.join(names)
hours = ', '.join(hours)
more_hours = ', '.join(str(s) for s in [sum(float(q) for q in e) for e in more_hours])
print(names)
print(hours)
print(more_hours)
Output
Spencer, Ruiz, Weiss, Choi, Miller, Barnes
12.75, 18, 14.80, 15, 18, 15
42.0 41.5 38.0 21.5 21.5 22.5

Try using zip:
with open("empwages.txt") as f:
lines = [line.split() for line in f]
names, hours, *more_hours = zip(*lines)
print(names)
print(hours)
print([sum(map(float, i)) for i in zip(*more_hours)])
('Spencer', 'Ruiz', 'Weiss', 'Choi', 'Miller', 'Barnes')
('12.75', '18', '14.80', '15', '18', '15')
[42.0, 41.5, 38.0, 21.5, 21.5, 22.5]
This will:
Split the file up by line, and split the lines up by word
Put the first word of each line in names, the second in hours, and the rest in more_hours
You can add more variables before the *_ as needed.
(Edited to correctly sum hours).

Well if you're not opposed to using pandas:
import pandas as pd
from StringIO import StringIO
import re
initial_data = '''Spencer 12.75 8 8 8 8 10
Ruiz 18 8 8 9.5 8 8
Weiss 14.80 7 5 8 8 10
Choi 15 4 7 5 3.3 2.2
Miller 18 6.5 9 1 4 1
Barnes 15 7.5 9 4 0 2'''
df = pd.read_csv(StringIO(re.sub(r'[ ]+', ',', initial_data, flags=re.M)), header=None)
print(df)
0 1 2 3 4 5 6
0 Spencer 12.75 8.0 8 8.0 8.0 10.0
1 Ruiz 18.00 8.0 8 9.5 8.0 8.0
2 Weiss 14.80 7.0 5 8.0 8.0 10.0
3 Choi 15.00 4.0 7 5.0 3.3 2.2
4 Miller 18.00 6.5 9 1.0 4.0 1.0
5 Barnes 15.00 7.5 9 4.0 0.0 2.0
Then you can quickly sum over the columns like so:
df.loc[:, 1:].sum(axis=1)
0 54.75
1 59.50
2 52.80
3 36.50
4 39.50
5 37.50
dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert list to dataframe - python

Related

delete redundant rows in a dataframe with set in columns

Iterate through df row and append to a list w/o name and dtype

extracting upper and lower row if a condition is met

Is there a way to rank some items in a pandas dataframe and exclude others?

How do I separate arrays and add them based on their index in the array?

Categories

Resources