Getting a list of the range of 2 pandas columns - python

I have the following DataFrame (reformatted a bit):
f_name l_name n f_bought l_bought
0 Abraham Livingston 24 1164 1187
1 John Brown 4 1188 1191
2 Samuel Barret 16 1192 1207
3 Nathan Blodget 4 1208 1212
4 Bobby Abraham 1 1212 1212
I want to create a column, bought, that is a list range(df[f_bought], df[l_bought]).
I've tried:
def getRange(l1,l2):
r = list(range(l1, l2))
df.apply(lambda index: getRange(df['f_bond'], df['l_bond']),axis=1)
but it results in a TypeError:
"cannot convert the series to <type 'int'>", u'occurred at index 0'
I've tried a df.info(), and both columns are type int64.
I'm wondering if I should use something like df.loc[] or similar? Or something else entirely?

You should be able to do this using apply which is for applying a function to every row or every column of a data frame.
def bought_range(row):
return range(row.f_bought, row.l_bought)
df['bought_range'] = df.apply(bought_range, axis=1)
Which results in:
f_name l_name n f_bought l_bought \
0 Abraham Livingston 24 1164 1187
1 John Brown 4 1188 1191
2 Samuel Barret 16 1192 1207
3 Nathan Blodget 4 1208 1212
4 Bobby Abraham 1 1212 1212
bought_range
0 [1164, 1165, 1166, 1167, 1168, 1169, 1170, 117...
1 [1188, 1189, 1190]
2 [1192, 1193, 1194, 1195, 1196, 1197, 1198, 119...
3 [1208, 1209, 1210, 1211]
4 []
One word of warning is that Python's range doesn't include the upper limit:
In [1]: range(3, 6)
Out[1]: [3, 4, 5]
It's not hard to deal with (return range(row.f_bought, row.l_bought + 1)) but does need taking into account.

Related

Get length of values in pandas dataframe column

I'm trying to get the length of each zipCd value in the dataframe mentioned below. When I run the code below I get 958 for every record. I'm expecting to get something more like '4'. Does anyone see what the issue is?
Code:
zipDfCopy['zipCd'].str.len()
Data:
print zipDfCopy[1:5]
Zip Code Place Name State State Abbreviation County \
1 544 Holtsville New York NY Suffolk
2 1001 Agawam Massachusetts MA Hampden
3 1002 Amherst Massachusetts MA Hampshire
4 1003 Amherst Massachusetts MA Hampshire
Latitude Longitude zipCd
1 40.8154 -73.0451 0 501\n1 544\n2 1001...
2 42.0702 -72.6227 0 501\n1 544\n2 1001...
3 42.3671 -72.4646 0 501\n1 544\n2 1001...
4 42.3919 -72.5248 0 501\n1 544\n2 1001...
One way is to convert to string and use pd.Series.map with len built-in.
pd.Series.str is used for vectorized string functions, while pd.Series.astype is used to change column type.
import pandas as pd
df = pd.DataFrame({'ZipCode': [341, 4624, 536, 123, 462, 4642]})
df['ZipLen'] = df['ZipCode'].astype(str).map(len)
# ZipCode ZipLen
# 0 341 3
# 1 4624 4
# 2 536 3
# 3 123 3
# 4 462 3
# 5 4642 4
A more explicit alternative is to use np.log10:
df['ZipLen'] = np.floor(np.log10(df['ZipCode'].values)).astype(int) + 1

python error when finding count of cells where value was found

I have below code on toy data which works the day i want. Last 2 columns provide how many times value in column Jan was found in column URL and in how many distinct rows value in column Jan was found in column URL
sales = [{'account': '3', 'Jan': 'xxx', 'Feb': '200 .jones', 'URL': 'ea2018-001.pdf try bbbbb why try'},
{'account': '1', 'Jan': 'try', 'Feb': '210', 'URL': ''},
{'account': '2', 'Jan': 'bbbbb', 'Feb': '90', 'URL': 'ea2017-104.pdf bbbbb cc for why try' }]
df = pd.DataFrame(sales)
df
df['found_in_column'] = df['Jan'].apply(lambda x: ''.join(df['URL'].tolist()).count(x))
df['distinct_finds'] = df['Jan'].apply(lambda x: sum(df['URL'].str.contains(x)))
why does the same code fails in the last case? How could i change my code to avoid the error. In case of my last example there are special characters in the first column, I felt that they are causing the problem. But when i look at row where index is 3 and 4, they have special characters too and code runs fine
answer2=answer[['Value','non_repeat_pdf']].iloc[0:11]
print(answer2)
Value non_repeat_pdf
0 effect\nive Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 closing ####
2 executing ####
3 order, ####
4 waives: ####
5 right ####
6 notice ####
7 intention ####
8 prohibit ####
9 further ####
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[220]:
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 1
9 0
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[10:11]
print(answer2)
Value non_repeat_pdf
10 participation ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Out[212]:
10 0
Name: Value, dtype: int64
answer2=answer[['Value','non_repeat_pdf']].iloc[11:12]
print(answer2)
Value non_repeat_pdf
11 1818(e); ####
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
Traceback (most recent call last):
File "<ipython-input-215-2df7f4b2de41>", line 1, in <module>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "<ipython-input-215-2df7f4b2de41>", line 1, in <lambda>
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x)))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 1562, in contains
regex=regex)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\strings.py", line 254, in str_contains
stacklevel=3)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update
I modified my code and removed all special character from the Value column. I am still getting the error...what could be wrong.
Even with the error, the new column gets added to my answer2 dataframe
answer2=answer[['Value','non_repeat_pdf']]
print(answer2)
Value non_repeat_pdf
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
[1582 rows x 2 columns]
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
Traceback (most recent call last):
File "<ipython-input-298-4dc80361895c>", line 1, in <module>
answer2['found_in_all_PDF'] = answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2404, in _set_item
self._check_setitem_copy()
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 1873, in _check_setitem_copy
warnings.warn(t, SettingWithCopyWarning, stacklevel=stacklevel)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1069, in _showwarning
file.write(formatWarning(message, category, filename, lineno, line))
File "C:\Users\\AppData\Local\Continuum\anaconda3\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
IndexError: list index out of range
update2
below works
answer2=answer[['Value','non_repeat_pdf']]
xyz= answer2['Value'].apply(lambda x: ''.join(answer2['non_repeat_pdf'].tolist()).count(x))
xyz=xyz.to_frame()
xyz.columns=['found_in_all_PDF']
pd.concat([answer2, xyz], axis=1)
Out[305]:
Value non_repeat_pdf \
0 law Initials: __\nDL_ -1- Date: __\n8/14/2017\n...
1 concerned
2 rights
3 c
4 violate
5 8
6 agreement
7 voting
8 previously
9 supervisory
10 its
11 exercise
12 occs
13 entities
14 those
15 approved
16 1818h2
17 9
18 are
19 manner
20 their
21 affairs
22 b
23 solicit
24 procure
25 transfer
26 attempt
27 extraneous
28 modification
29 vote
... ...
1552 closing
1553 heavily
1554 pm
1555 throughout
1556 half
1557 window
1558 sixtysecond
1559 activity
1560 sampling
1561 using
1562 hour
1563 violated
1564 euro
1565 rates
1566 derivatives
1567 portfolios
1568 valuation
1569 parties
1570 numerous
1571 they
1572 reference
1573 because
1574 us
1575 important
1576 moment
1577 snapshot
1578 cet
1579 215
1580 finance
1581 supervision
found_in_all_PDF
0 6
1 1
2 4
3 1036
4 9
5 93
6 4
7 2
8 1
9 2
10 6
11 1
12 0
13 1
14 3
15 1
16 0
17 25
18 20
19 3
20 14
21 4
22 358
23 2
24 1
25 2
26 6
27 1
28 1
29 3
...
1552 3
1553 2
1554 0
1555 5
1556 2
1557 3
1558 0
1559 2
1560 1
1561 5
1562 2
1563 7
1564 8
1565 3
1566 0
1567 1
1568 1
1569 4
1570 1
1571 9
1572 2
1573 2
1574 96
1575 1
1576 1
1577 1
1578 0
1579 0
1580 1
1581 0
[1582 rows x 3 columns]
Unfortunately i can't reproduce exactly same error on my environment. But what I see is warning about wrong regexp usage. Your string was interpreted as capturing regular expression because of brackets in the string "1818(e);". Try use str.contains with regex=False.
answer2 =pd.DataFrame({'Value': {11: '1818(e);'}, 'non_repeat_pdf': {11: '####'}})
answer2['Value'].apply(lambda x: sum(answer2['non_repeat_pdf'].str.contains(x,regex=False)))
Output:
11 0
Name: Value, dtype: int64

pandas dataframe sort on column raises keyerror on index

I have the following dataframe, df:
peaklatency snr
0 52.99 0.0
1 54.15 62.000000
2 54.12 82.000000
3 54.64 52.000000
4 54.57 42.000000
5 54.13 72.000000
I'm attempting to sort this by snr:
df.sort_values(df.snr)
but this raises
_convert_to_indexer(self, obj, axis, is_setter)
1208 mask = check == -1
1209 if mask.any():
-> 1210 raise KeyError('%s not in index' % objarr[mask])
1211
1212 return _values_from_object(indexer)
KeyError: '[ inf 62. 82. 52. 42. 72.] not in index'
I am not explicitly setting an index on this DataFrame, it's coming from a list comprehension:
import pandas as pd
d = []
for run in runs:
d.append({
'snr': run.periphery.snr.snr,
'peaklatency': (run.brainstem.wave5.wave5.argmax() / 100e3) * 1e3
})
df = pd.DataFrame(d)
The by keyword to sort_values expects column names, not the actual Series itself. So, you'd want:
In [23]: df.sort_values('snr')
Out[23]:
peaklatency snr
0 52.99 0.0
4 54.57 42.0
3 54.64 52.0
1 54.15 62.0
5 54.13 72.0
2 54.12 82.0

Conditional function on multiple rows

I have a csv file like so:
Landform Number Name Class
0 Deltaic Plain 912 Lx NaN
1 Hummock and Swale 912 Lx NaN
2 Sand Dunes 912 Lx NaN
3 Hummock and Swale 939 Woodbury NaN
4 Sand Dunes 939 Woodbury NaN
and when Landform contains Deltaic Plain, Hummock and Swale and Sand Dunes for a particular Name I want to assign a value of 1 to Class.
When Landform is contains Hummock and Swale and Sand Dunes I want to assign a value of 2 for Class.
My desired output is:
Landform Number Name Class
0 Deltaic Plain 912 Lx 1
1 Hummock and Swale 912 Lx 1
2 Sand Dunes 912 Lx 1
3 Hummock and Swale 939 Woodbury 2
4 Sand Dunes 939 Woodbury 2
I know how to do this for just 1 row like this:
def f(x):
if x['Landform'] == 'Hummock and Swale' : return '1'
else: return '2'
df['Class'] = df.apply(f, axis=1)
but I am not sure how to group by Name and then create the conditional functions based on numerous rows.
The idea is to group on your Number column, and apply a function that looks at all the landforms in that group and returns an appropriate class. Here's an example:
def determineClass(landforms):
if all(form in landforms.values for form in ('Deltaic Plain', 'Hummock and Swale', 'Sand Dunes')):
return 1
elif all(form in landforms.values for form in ('Hummock and Swale', 'Sand Dunes')):
return 2
# etc.
else:
# return "default" class
return 0
>>> df.groupby('Number').Landform.apply(determineClass)
Number
912 1
939 2
Name: Landform, dtype: int64
If you want to assign the values back into the Class column, just use map, as described in this question from 20 minutes ago:
>>> classes = df.groupby('Number').Landform.apply(determineClass)
>>> df['Class'] = df.Number.map(classes)
>>> df
Landform Number Name Class
0 Deltaic Plain 912 Lx 1
1 Hummock and Swale 912 Lx 1
2 Sand Dunes 912 Lx 1
3 Hummock and Swale 939 Woodbury 2
4 Sand Dunes 939 Woodbury 2

Error when using pandas dataframe map function in ipython notebook

I'm just starting out with Python and getting stuck on something while playing with the Kaggle Titanic data.
https://www.kaggle.com/c/titanic/data
Here's what I am typing in an ipython notebook (train.csv comes from the titanic data from the kaggle link above):
import pandas as pd
df = pd.read_csv("C:/fakepath/titanic/data/train.csv")
I then continue with this to check if there's any bad data in the 'Sex' column:
df['Sex'].value_counts()
Which returns:
male 577
female 314
dtype: int64
df['Gender'] = df['Sex'].map( {'male': 1, 'female': 0} ).astype(int)
this doesn't produce any errors. To verify that it created a new column called 'Gender' with integer values :
df
which returns:
# PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Gender
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0
...success, the Gender column is appended to the end and is 0 for female, 1 for male. Now, I create a new pandas dataframe which is a subset of the df dataframe.
df2 = df[ ['Survived', 'Pclass', 'Age', 'Gender', 'Embarked'] ]
df2
which returns:
Survived Pclass Age Gender Embarked
0 0 3 22 1 S
1 1 1 38 0 C
2 1 3 26 0 S
3 1 1 35 0 S
4 0 3 35 1 S
5 0 3 NaN 1 Q
df2['Embarked'].value_counts()
...shows that there are 3 unique values (S, C, Q):
S 644
C 168
Q 77
dtype: int64
However, when I try to execute what I think is the same type of operation as when I converted male/female to 1/0, I get an error:
df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)
returns:
ValueError Traceback (most recent call last)
<ipython-input-29-294c08f2fc80> in <module>()
----> 1 df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)
C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in astype(self, dtype, copy, raise_on_error)
2212
2213 mgr = self._data.astype(
-> 2214 dtype=dtype, copy=copy, raise_on_error=raise_on_error)
2215 return self._constructor(mgr).__finalize__(self)
2216
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, **kwargs)
2500
2501 def astype(self, dtype, **kwargs):
-> 2502 return self.apply('astype', dtype=dtype, **kwargs)
2503
2504 def convert(self, **kwargs):
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
2455 copy=align_copy)
2456
-> 2457 applied = getattr(b, f)(**kwargs)
2458
2459 if isinstance(applied, list):
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, copy, raise_on_error, values)
369 def astype(self, dtype, copy=False, raise_on_error=True, values=None):
370 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 371 values=values)
372
373 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
399 if values is None:
400 # _astype_nansafe works fine with 1-d only
--> 401 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
402 values = values.reshape(self.values.shape)
403 newb = make_block(values,
C:\Anaconda\lib\site-packages\pandas\core\common.pyc in _astype_nansafe(arr, dtype, copy)
2616
2617 if np.isnan(arr).any():
-> 2618 raise ValueError('Cannot convert NA to integer')
2619 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
2620 # work around NumPy brokenness, #1987
ValueError: Cannot convert NA to integer
Any idea why I get this error on the 2nd use of the map function but not the first? There are no NAN values in the Embarked column per value_counts(). I'm guessing it's a noob problem :)
by default value_counts does not count NaN values, you can change this by doing df['Embarked'].value_counts(dropna=False) .
I looked at your value_counts for Gender column (577 + 314 = 891) versus Embarked column (644 + 168 + 77 = 889) and they are different by 2 which means you must have 2 NaN values.
So you either drop them first (using dropna) or fill them with some desired value using fillna.
Also the astype(int) is redundant as you are mapping to an int anyway.
I just came across this problem on the same dataset. Removing 'astype.int' solved the whole problem.

Categories