Pandas issue iterating over DataFrame - python

I'm using pandas to scrape a web page and iterate through a DataFrame object. Here's the function I'm calling:
def getTeamRoster(teamURL):
teamPlayers = []
table = pd.read_html(requests.get(teamURL).content)[4]
nameTitle = '\n\t\t\t\tPlayers\n\t\t\t'
ratingTitle = 'SinglesRating'
finalTable = table[[nameTitle, ratingTitle]][:-1]
print(finalTable)
for index, row in finalTable:
print(index, row)
I'm using the syntax advocated here:
http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/
However, I'm getting this error:
File "SquashScraper.py", line 46, in getTeamRoster
for index, row in finalTable:
ValueError: too many values to unpack (expected 2)
For what it's worth, my finalTable prints as this:
\n\t\t\t\tPlayers\n\t\t\t SinglesRating
0 Browne,Noah 5.56
1 Ellis,Thornton 4.27
2 Line,James 4.25
3 Desantis,Scott J. 5.08
4 Bahadori,Cameron 4.97
5 Groot,Michael 4.76
6 Ehsani,Darian 4.76
7 Kardon,Max 4.83
8 Van,Jeremy 4.66
9 Southmayd,Alexander T. 4.91
10 Cacouris,Stephen A 4.68
11 Groot,Christopher 4.62
12 Mack,Peter D. (sub) 3.94
13 Shrager,Nathaniel O. 0.00
14 Woolverton,Peter C. 4.06
which looks right to me.
Any idea why python doesn't like my syntax?
Thanks for the help,
bclayman

You probably want to try this:
for index, row in finalTable.iterrows():
print(index, row)

Related

multiplying column of the file by exponential function

I,m struggling with multiplying one column file by an exponential function
so my equation is
y=10.43^(-x/3.0678)+0.654
The first values in the column are my X in the equation, so far I was able to multiply only by scalars but with exponential functions not
the file looks like this
8.09
5.7
5.1713
4.74
4.41
4.14
3.29
3.16
2.85
2.52
2.25
2.027
1.7
1.509
0.76
0.3
0.1
So after the calculations, my Y should get these values
8.7 0.655294908
8.09 0.656064021
5.7 0.6668238549
5.1713 0.6732091509
4.74 0.6807096436
4.41 0.6883719253
4.14 0.6962497391
3.29 0.734902438
3.16 0.7433536016
2.85 0.7672424605
2.52 0.7997286905
2.25 0.8331287249
2.027 0.8664148415
1.7 0.926724933
1.509 0.9695896976
0.76 1.213417197
0.3 1.449100509
0.1 1.580418766````
So far this code is working for me but it´s far away from what i want
from scipy.optimize import minimize_scalar
import math
col_list = ["Position"]
df = pd.read_csv("force.dat", usecols=col_list)
print(df)
A = df["Position"]
X = ((-A/3.0678+0.0.654)
print(X)
If I understand it correctly you just want to apply a function to a column in a pandas dataframe, right? If so, you can define the function:
def foo(x):
y = 10.43 ** (-x/3.0678)+0.654
return y
and apply it to df in a new column. If A is the column with the x values, then y will be
df['y'] = df.apply(foo,axis=1)
Now print(df) should give you the example result in your question.
You can do it in one line:
>>> df['y'] = 10.43 ** (- df['x']/3.0678)+0.654
>>> print(df)
x y
0 8.0900 0.656064
1 5.7000 0.666824
2 5.1713 0.673209
3 4.7400 0.680710
4 4.4100 0.688372
5 4.1400 0.696250
6 3.2900 0.734902
7 3.1600 0.743354
8 2.8500 0.767242
9 2.5200 0.799729
10 2.2500 0.833129
11 2.0270 0.866415
12 1.7000 0.926725
13 1.5090 0.969590
14 0.7600 1.213417
15 0.3000 1.449101
16 0.1000 1.580419

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.
Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'

pandas getting most frequent names from a column which has list of names

my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()

how to slicing csv file using python's panda

I have this csv file:
DATE RELEASE 10GB 100GB 200GB 400GB 600GB 800GB 1000GB
5/5/16 2.67 0.36 4.18 8.54 18 27 38 46
5/5/16 2.68 0.5 4.24 9.24 18 27 37 46
5/6/16 2.69 0.32 4.3 9.18 19 29 37 46
5/6/16 2.7 0.35 4.3 9.3 19 28 37 46
5/6/16 2.71 0.3 4.1 8 8.18 16 24 33 41
I need to calculate the difference of each column (10 GB ~ 1000GB)between release 2.71 and release 2.70. That means last row - the row above.
My code to access each desired row are: row1=df[df.RELEASE == 2.70], and row2 = df[df.RELEASE == 2.71]
My issue is: I do not know how to access each element. I try to put the row1 and row2 into list. list(row1), list(row2), but that only print the title rather than the value of each cell.
My question is: how do I acces each element of desired row, so I can calculat: "0.3 -0.35" Thanks for helping!
If its really the last two rows you're chasing try out negative indexing
df.loc[-1,:] - df.loc[-2,:]
I'm on a phone so haven't run the code but it should get you closer.

Selecting min values in SQL table when column < 0, and max values when column > 0

I have a SQL table like this:
Ticker Return Shares
AGJ 2.20 1265
ATA 1.78 698
ARS 9.78 10939
ARE -7.51 -26389
AIM 0.91 1758
ABT 10.02 -5893
AC -5.73 -2548
ATD 6.51 7850
AP 1.98 256
ALA -9.58 8524
So essentially, a table of stocks I've longed/shorted.
I want to find the top 4 best performers in this table, so the shorts (shares < 0) who have the lowest return, and the longs (shares > 0) who have the highest return.
Essentially, returning this:
Ticker Return Shares
ARS 9.78 10939
ARE -7.51 -26389
AC -5.73 -2548
ATD 6.51 7850
How would I be able to write the query that lets me do this?
Or, if it's easier, if there are any pandas functions that would do the same thing if I turned this table into a pandas dataframe.
Something like this:
select top (4) t.*
from t
order by (case when shares < 0 then - [return] else [return] end) desc;
Pandas solution:
In [134]: df.loc[(np.sign(df.Shares)*df.Return).nlargest(4).index]
Out[134]:
Ticker Return Shares
2 ARS 9.78 10939
3 ARE -7.51 -26389
7 ATD 6.51 7850
6 AC -5.73 -2548
Explanation:
In [137]: (np.sign(df.Shares)*df.Return)
Out[137]:
0 2.20
1 1.78
2 9.78
3 7.51
4 0.91
5 -10.02
6 5.73
7 6.51
8 1.98
9 -9.58
dtype: float64
In [138]: (np.sign(df.Shares)*df.Return).nlargest(4)
Out[138]:
2 9.78
3 7.51
7 6.51
6 5.73
dtype: float64

Categories