Key error when using .pivot in python pandas - python

I have looked at lots of pivot table related questions and found none that addressed this specific problem. I have a data frame like this:
Drug Timepoint Tumor Volume (mm3)
Capomulin 0 45.000000
5 44.266086
10 43.084291
15 42.064317
20 40.716325
... ... ...
Zoniferol 25 55.432935
30 57.713531
35 60.089372
40 62.916692
45 65.960888
I am trying to pivot the data so that the name of the drug becomes the column headings, timepoint becomes the new index, and the tumor volume is the value. Everything I have looked up online tells me to use:
mean_tumor_volume_gp.pivot(index = "Timepoint",
columns = "Drug",
values = "Tumor Volume (mm3)")
However, when I run this cell, I get the error message:
KeyError Traceback (most recent call last)
<ipython-input-15-788b92ba981e> in <module>
2 mean_tumor_volume_gp.pivot(index = "Timepoint",
3 columns = "Drug",
----> 4 values = "Tumor Volume (mm3)")
5
KeyError: 'Timepoint'
How is this a key error? The key "Timepoint" is a column in the original DF.

Related

Print the total population in from specific category within region column

I have a Population column with numbers and a Region column with locations. Im only using pandas. How would I go about finding the total population of a specific location (wellington) within the Region column?
Place = [data[‘Region’] == ‘Wellington’]
Place[data[‘Population’]]
an error came up
TypeError Traceback (most recent call last)
Input In [70], in <cell line: 4>()
1 #Q1.e
3 Place = [data['Region']=='Wellington']
----> 4 Place[data['Population']]
TypeError: list indices must be integers or slices, not Series
Try this:
data_groups = data.groupby("Region")['Population'].sum()
Output:
data_groups
Region
Northland 4750
Wellington 7580
WestCoast 1550
If you want to call some specific region, you can do:
data_groups.loc['WestCoast'] # 1550
Use DataFrame.loc with sum:
Place = data.loc[data['Region'] == 'Wellington', 'Population'].sum()
print (Place)
7190
Another idea is convert Region to index, select by Series.loc and then sum:
Place = data.set_index('Region')['Population'].loc['Wellington'].sum()
print (Place)
7190

Having trouble with - class 'pandas.core.indexing._AtIndexer'

I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.
There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.

What are "pairs" when comparing the records of each record pair in recordlinkage?

I have a set of real estate ad data. Several of the lines are about the same real estate, so it's full of duplicates that aren't exactly the same. It looks like this :
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0 22c05930-0eb5-11e7-b53d-bbead8ba43fe http://www.avendrealouer.fr/location/levallois... A_VENDRE_A_LOUER APARTMENT False Au rez de chaussée d'un bel immeuble récent,... ["https://cf-medias.avendrealouer.fr/image/_87... 72.0 NaN NaN ... Lamirand Et Associes AGENCY 54178039 Levallois-Perret 92300.0 92 2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1 8d092fa0-bb99-11e8-a7c9-852783b5a69d https://www.bienici.com/annonce/ag440414-16547... BIEN_ICI APARTMENT False Je vous propose un appartement dans la rue Col... ["http://photos.ubiflow.net/440414/165474561/p... 48.0 NaN NaN ... Proprietes Privees MANDATARY 54178039 Levallois-Perret 92300.0 92 2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89 2018-09-25
...
I want to find records in the dataset belonging to the same entity with recordlinkage. So I read the docs and mimitized the same :
indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(df)
print (len(df), len(candidate_links))
2164 2340366
Each record pair being a candidate match, to classify the candidate record pairs into matches and non-matches, I want to compare the records on all attributes both records have in common. The recordlinkage module has a class named Compare. This class is used to compare the records. The following code shows how I compared attributes :
compare_cl = recordlinkage.Compare()
compare_cl = recordlinkage.Compare()
compare_cl.exact('SURFACE', 'SURFACE', label='SURFACE')
features = compare_cl.compute(pairs, df)
However it gives me back :
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-51-1e55ea540dbd> in <module>
9 #compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
10
---> 11 features = compare_cl.compute(pairs, df)
NameError: name 'pairs' is not defined
And I can't find what pairs are in the docs ...
try with candidate_links
https://recordlinkage.readthedocs.io/en/latest/ref-compare.html
compute(pairs, x, x_link=None)
Compare the records of each record pair.
Calling this method starts the comparing of records.
Parameters:
pairs (pandas.MultiIndex) – A pandas MultiIndex with the record pairs to compare. The indices in the MultiIndex are indices of the DataFrame(s) to link.

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

Pandas ExcelFile.parse() reading file in as dict instead of dataframe

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)
I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:
xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df
Results:
{'Security exposure - 21 day lag': Percent of Total Holdings \
0 KMNFC vs. 3 Month LIBOR AUD
1 04-OCT-16
2 Australian Dollar
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 Long/Short Net Exposure
9 Total
10 NaN
11 Long
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
(This goes on for 20-30 more rows and 5-6 more columns)
I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:
However, I cannot use iloc:
df.iloc[:,1]
Traceback (most recent call last):
File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
df.iloc[:,1]
AttributeError: 'dict' object has no attribute 'iloc'
Any thoughts? What am I missing?
EDIT:
To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.
EDIT (2):
Per a suggestion, I tried 'read_excel', with the same results:
df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
The problem was here:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
which returned a single element list. I changed it to the following:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]
which returns a string, and the code then performs as expected.
All thanks to ayhan for pointing this out.

Categories