ete3: How to get taxonomic rank names from taxonomy id? - python

I want to use this to convert a bunch of identifiers but I need to know exactly which taxonomic rank is assigned to each taxonomy code. Shown below is an example of conversion that makes sense but I don't know what to label some of the taxonomy calls. The basic taxonomic ranks are: (domain, kingdom, phylum, class, order, family, genus, and species) https://en.wikipedia.org/wiki/Taxonomic_rank.
For most cases it will be easy, but in the case of having subspecies and strains for bacteria this can get confusing.
How do I get ete3 to specify what rank the lineage IDs correspond to in the taxonomic rank?
import ete3
import pandas as pd
ncbi = ete3.NCBITaxa()
taxon_id = 505
lineage = ncbi.get_lineage(taxon_id)
Se_lineage = pd.Series(ncbi.get_taxid_translator(lineage), name=taxon_id)
Se_lineage[lineage]
1 root
131567 cellular organisms
2 Bacteria
1224 Proteobacteria
28216 Betaproteobacteria
206351 Neisseriales
481 Neisseriaceae
32257 Kingella
505 Kingella oralis
Name: 505, dtype: object

Use ncbi.get_rank() to get a dictionary of {id:name} then do some basic transformations to get {name:taxonomy}

Related

how to construct graph from house price prediction dataset

I have a dataset of house price predictions.
House id
society_id
building_type
households
yyyymmdd
floor
price
date
204
a9cvzgJ
170
185
01/02/2006
3
43000
01/02/2006
100
a4Nkquj
170
150
01/04/2006
13
46300
01/04/2006
the dataset has the shape of (2000,40)
while 1880 rows have same house id.
I have to make heterogenous graphs from dataset. the metapaths are as follows:
here BT stands for building type, where H1 and H2 represents house 1 and house 2.
the meta graph example is:
I know of network X. it allows dataframe to graph function . but i don't know how can i use in my scenario. the price column is target node.
A glimpse of dataset
any guidance will mean a lot.
thank you. The goal is to make adjancy matrix of dataset
To build a graph like M_1 using only one attribute (such as building type), you could do either of the following. You could use the from_pandas_edgelist as follows:
G = nx.from_pandas_edgelist(df2, source = 'house_id', target = 'buidling_id')
or you could do the following:
G = nx.Graph()
G.add_edges_from(df.loc[:,['house_id','building_id']].to_numpy())
If you have a list of graphs glist : [M_1,M_2,...] each of which connects house_id to one other attribute, you can combine them using the compose_all function. For instance,
G = nx.compose_all(glist)
Alternatively, if you have an existing graph made using certain attributes, you can add another attribute with
G.add_edges_from(df.loc[:,['house_id','new_attribute']].to_numpy())

How to parse suburb from unstructured address string

Python noob here.
I am working with a large dataset that includes a column with unstructured strings. I need to develop a way to create a list that includes all of the suburb names in Australia (I can source this easily). I then need a program that parses through the string, and where a sequence matches an entry in the list, it saves the substring to a new column. The dataset was appended from multiple sources, so there is no consistent structure to the strings.
As an example, the rows look like this:
GIBSON AVE PADSTOW NSW 2211
SYDNEY ROAD COBURG VIC 3058
DUNLOP ST, ROSELANDS
FOREST RD HURSTVILLE NSW 2220
UNKNOWN
JOSEPHINE CRES CHERRYBROOK NSW 2126
I would be greatly appreciative if anyone has any example code that they can share with me, or if you can point me in the right direction for the most appropriate tool/method to use.
In this example, the expected output would look like:
'Padstow'
'Coburg'
'Roselands'
'Hurstville'
''
'Cherrybrook'
EDIT:
Would this code work?
import pandas as pd
import numpy as np
suburb_list = np.genfromtxt('filepath/nsw.csv',
delimiter=',', dtype=str)
top_row = suburb_list[:].tolist()
dataset = pd.read_csv(‘filepath/dataset.csv')
def get_suburb(dataset.address):
for s in suburb_list:
if s in address.lower()
return s
So for a pretty simple approach, you could just use a big list with all the suburb names in lower case, and then do:
suburbs = [ 'padstow', 'cowburg', .... many more]
def get_suburb(unstructured_string):
for s in suburbs:
if s in unstructured_string.lower()
return s
This will give you the first match. If you want to get fancy and maybe try to get it right in the face of misspellings etc., you could try "fuzzy" string comparison methods like the Levenshtein distance (for which you'd have to separate the string into individual words first).

JSON file with different array lengths in Python

I want to explore the population data freely available online at https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json . It contains population details of UK from 1981 to 2017. The code I used so far is below
import requests
import json
import pandas
json_url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'
# download the data
j = requests.get(url=json_url)
# load the json
content = json.loads(j.content)
list(content.keys())
The last line of code above gives me the below output:
['version',
'class',
'label',
'source',
'updated',
'value',
'id',
'size',
'role',
'dimension',
'extension']
I then tried to have a look at the lengths of 'Value', 'size' and 'role'
print (len(content['value']))
print (len(content['size']))
print (len(content['role']))
And I got the results as below:
22200
5
3
As we can see the lengths very different. I cannot covert it into a dataframe as they are all different lengths.
How can I change this to a meaningful format so that I can start exploring it? Iam required to do analysis as below:
1.A table showing the male, female and total population in columns, per UK region in rows, as well as the UK total, for the most recent year
Exploratory data analysis to show how the population progressed by regions and age groups
You should first read the content of the Json file except value, because the other fields explain what the value field is. And it is a (flattened...) multidimensional matrix with dimensions content['size'], that is 37x4x3x25x2, and the description of each dimension is given in content['dimension']. First dimension is time with 37 years from 1981 to 2017, then geography with Wales, Scotland, Northern Ireland and England_and_Wales. Next come sex with Male, Female and Total, followed by ages with 25 classes. At the very end, you will find the measures where first is the total number of persons, and the second is its percent number.
Long story short, only content['value'] will be used to feed the dataframe, but you first need to understand how.
But because of the 5 dimensions, it is probably better to first use a numpy matrix...
The data is a complex JSON file and as you stated correctly, you need the data frame columns to be of an equal length. What you mean to say by that, is that you need to understand how the records are stored inside your dataset.
I would advise you to use some JSON Viewer/Prettifier to first research the file and understand its structure.
Only then you would be able to understand which data you need to load to the DataFrame. For example, obviously, there is no need to load the 'version' and 'class' values into the DataFrame as they are not part of any record, but are metadata about the dataset itself.
This is JSON-stat format. See https://json-stat.org. You can use the python libraries pyjstat or json.stat.py to get the data to a pandas dataframe.
You can explore this dataset using the JSON-stat explorer

How to keep track of different types of missing values in pandas?

Summary
It's important in many scientific applications to keep track of different kinds of missing value. Is a value for 'weekly income from main job' missing because the person doesn't have a job, or because they have a job but refused to answer?
Storing all missing values as NA or NaN loses this information.
Storing missing value labels (e.g., 'missing because no job', 'missing because refused to answer') in a separate column means the researcher must keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors.
Storing missing value labels within the same column (e.g., as negative numbers, as in the example below, or very large numbers like 99999) means the researcher must manually keep track of how missing value labels are encoded for every column, and creates many other opportunities for mistakes (e.g., forgetting that a column includes missing values and taking a raw mean instead of using the correct mask).
It is very easy to handle this problem in Stata (see below), by using a data type that stores both numeric values and missing value labels, and with functions that know how to handle this data type. This is highly performant (data type remains numeric, not string or mixed – think of NumPy's data types, except instead of having just NaN we have NaN1, NaN2, etc.) What is the best way of achieving something like this in pandas?
Note: I'm an economist, but this is also an incredibly common workflow for political scientists, epidemiologists, etc. – anyone who deals with survey data. In this context, the analyst knows what the missing values are via a codebook, really cares about keeping track of them, and has hundreds or thousands of columns to deal with – so, indeed, needs an automated way of keeping track of them.
Motivation/context
It's extremely common when dealing with any kind of survey data to have multiple kinds of missing data.
Here is a minimal example from a government questionnaire used to produce official employment statistics:
[Q1] Do you have a job?
[Q2] [If Q1=Yes] What is your weekly income from that job?
The above occurs in pretty much every government-run labor force survey in the world (e.g., the UK Labour Force Survey, the US Current Population Survey, etc.).
Now, for a given respondent, if [Q2] is missing, it could be that (1) they answered No to [Q1], and so were ineligible to be asked [Q2], or that (2) they answered Yes to [Q1] but refused to answer [Q2] (perhaps because they were embarrassed at how much/little they earn, or because they didn't know).
As a researcher, it matters a great deal to me whether it was (1) that occurred, or whether it was (2). Suppose my job is to report the average weekly income of workers in the United States. If there are many missing values for this [Q2] column, but they are all labeled 'missing because respondent answered no to [Q1]', then I can take the average of [Q2] with confidence – it is, indeed, the average weekly income of people in work. (All the missing values are people who didn't have a job.)
On the other hand, if those [Q2] missing values are all labeled 'missing because respondent was asked this question but refused to answer', then I cannot simply report the average of [Q2] as the average weekly income of workers. I'll need to issue caveats around my results. I'll need to analyze the kinds of people who don't answer (are they missing at random, or are people in higher-income occupations more likely to refuse, for example, biasing my results?). Possibly I'll try to impute missing values, and so on.
The problem
Because these 'reasons for being missing' are so important, government statistical agencies will code the different reasons within the column:
So the column containing the answers to [Q2] above might contain the values [1500, -8, 10000, -2, 3000, -1, 6400].
In this case, '1500', '10000', and so on are 'true' answers to [Q2] ($1,500 weekly income, $10,000 weekly income, etc.); whereas '-8' means they weren't eligible to answer (because they answered No to [Q1]) '-2' means they were eligible to answer but refused to do so, and so on.
Now, obviously, if I take the average of this column, I'm going to get something meaningless.
On the other hand, if I just replace all negative values with NaN, then I can take the average – but I've lost all this valuable information about why values are missing. For example, I may want to have a function that takes any column and reports, for that column, statistics like the mean and median, the number of eligible observations (i.e., everything except value=-8), and the percent of those that were non-missing.
It works great in Stata
Doing this in Stata is extremely easy. Stata has 27 numeric missing categories: '.a' to '.z'. (More details here.) I can write:
replace weekly_income = .a if weekly_income == -1
replace weekly_income = .b if weekly_income == -8
and so on.
Then (in pseudocode) I can write
stats weekly_income if weekly_income!=.b
When reporting the mean, Stata will automatically ignore the values coded as missing (indeed, they're now not numeric); but it will also give me missing value statistics only for the observations I care about (in this case, those eligible to be asked the question, i.e., those who weren't originally coded '-8').
What is the best way to handle this in Pandas?
Setup:
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict({
'income': [1500, -8, 10000, -2, 3000, -1, 6400]})
Desired outcome:
>>> df.income.missing_dict = {'-1': ['.a', 'Don\'t know'], '-2': ['.b', 'Refused']} # etc.
>>> df
income
0 1500
1 Inapplic.
2 10000
3 Refused
4 3000
5 Don't know
6 6400
>>> assert df.income.mean() == np.mean([1500, 10000, 3000, 6400])
(passes)
The 'obvious' workaround
Clearly, one option is to split every column into two columns: one numeric column with non-missing values and NaNs, and the other a categorical column with categories for the different types of missing value.
But this is extremely inconvenient. These surveys often have thousands of columns, and a researcher might well use hundreds in certain kinds of economic analysis. Having two columns for every 'underlying' column means the researcher has to keep track of two columns for every operation she performs – such as groupby, renaming, and so on. This creates endless opportunities for mistakes and errors. It also means that displaying the table is very wasteful – for any column, I need to now display two columns, one of which for any given observation is always redundant. (This is wasteful both of screen real estate, and of the human analysts' attention, having to identify which two columns are a 'pair'.)
Other ideas
Two other thoughts that occur to me, both probably non-ideal:
(1) Create a new data type in pandas that works similarly to Stata (i.e., adds '.a', '.b', etc. to allowable values for numeric columns).
(2) Use the two-columns solution above, but (re)write 'wrapper' functions in pandas so that 'groupby' etc. keeps track of the pairs of columns for me.
I suspect that (1) is the best solution for the long term, but it would presumably require a huge amount of development.
On the other hand, maybe there are already packages that solve this? Or people have better work-arounds?
To show the solution, I'm taking the liberty of changing the missing_dict keys to match the data type of income.
>>> df
income
0 1500
1 -8
2 10000
3 -2
4 3000
5 -1
6 6400
>>> df.income.missing_dict
{-8: ['.c', 'Stifled by companion'], -2: ['.b', 'Refused'], -1: ['.a', "Don't know"]}
Now, here's how to filter the rows according to the values being in the "missing" list:
>>> df[(~df.income.isin((df.income.missing_dict)))]
income
0 1500
2 10000
4 3000
6 6400
Note the extra parentheses around the filter values: we have to pass a tuple of values to isin. Then apply the tilde operator, bit-wise negation, to get a series of Booleans.
Finally, apply mean to the resulting data column:
>>> df[(~df.income.isin((df.income.missing_dict)))].mean()
income 5225.0
dtype: float64
Does that toss you in the right direction? From here, you can simply replace income with the appropriate column or variable name as needed.
Pandas recently introduced a custom array type called ExtensionArray that allows defining what is in essence a custom column type, allowing you to (sort of) use actual values alongside missing data without dealing with two columns. Here is a very, very crude implementation, which has barely been tested:
import numpy as np
import pandas as pd
from pandas.core.arrays.base import ExtensionArray
class StataData(ExtensionArray):
def __init__(
self, data, missing=None, factors=None, dtype=None, copy=False
):
def own(array, dtype=dtype):
array = np.asarray(array, dtype)
if copy:
array = array.copy()
return array
self.data = own(data)
if missing is None:
missing = np.zeros_like(data, dtype=int)
else:
missing = own(missing, dtype=int)
self.missing = missing
self.factors = own(factors)
#classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
return cls(scalars, dtype=dtype, copy=copy)
#classmethod
def _from_factorized(cls, data, original):
return cls(original, None, data)
def __getitem__(self, key):
return type(self)(
self.data[key], self.missing[key], self.factors
)
def __setitem__(self, key, value):
self.data[key] = value
self.missing[key] = 0
def __len__(self):
return len(self.data)
def __iter__(self):
return iter(self.data)
#property
def dtype(self):
return self.data.dtype
#property
def shape(self):
return self.data.shape
#property
def nbytes(self):
return self.data.nbytes + self.missing.nbytes + self.factors.nbytes
def view(self):
return self
#property
def reason_missing(self):
return self.missing
def isna(self):
return self.missing != 0
def __repr__(self):
s = {}
for attr in ['data', 'missing', 'factors']:
s[attr] = getattr(self, attr)
return repr(s)
With this implementation, you can do the following:
>>> a = StataData([1, 2, 3, 4], [0, 0, 1, 0])
>>> s = pd.Series(a)
>>> print(s[s.isna()])
2 3
dtype: int32
>>> print(s[~s.isna()])
0 1
1 2
3 4
dtype: int32
>>> print(s.isna().values.reason_missing)
array([1])
Hopefully someone who understands this API can chime in and help improve this. For starters, a cannot be used in DataFrames, only Series.
>>> print(pd.DataFrame({'a': s}).isna())
0 False
1 False
2 False
3 False

Adding values to pandas dataframe with function based on other column in dataframe

This sounds similar to a lot of SO questions but I haven't actually found it; if it's here, please feel free to link and I'll delete.
I have two dataframes. The first looks like this:
owned category weight mechanics_split
28156 Environmental, Medical 2.8023 [Action Point Allowance System, Co-operative P...
9269 Card Game, Civilization, Economic 4.3073 [Action Point Allowance System, Auction/Biddin...
36707 Modern Warfare, Political, Wargame 3.5293 [Area Control / Area Influence, Campaign / Bat...
The second looks like this:
type amount owned
0 Action Point Allowance System 378 0
1 Co-operative Play 302 0
2 Hand Management 1308 0
3 Point to Point Movement 278 0
4 Set Collection 708 0
5 Trading 142 0
What I'm trying to do is iterate over each word in mechanics_split so that the owned value in the first dataframe is added to the owned column in the second dataframe. For example, if Dice Rolling is in the first row of games in the mechanics_split column, the owned amount for that whole row is added to games_owned['owned'], and so on, for each value in the list in mechanics_split through the whole dataframe.
So far, I've tried:
owned_dict = {}
def total_owned(x):
for e in x:
if e not in owned_dict:
owned_dict[e] = 0
if e in owned_dict:
owned_dict[e] += games['owned'][x]
return owned_dict
which returned:
KeyError: "None of [['Action Point Allowance System', 'Co-operative Play', 'Hand Management', 'Point to Point Movement', 'Set Collection', 'Trading', 'Variable Player Powers']] are in the [index]"
If I add another letter before e, I'm told there are too many values to unpack. I also tried skipping the dictionary and just using otherdf['owned'][e] += games['owned'][x] to no avail.
I may be fundamentally misunderstanding something about how indexes work in pandas and how to index a value to a row, so if I am, please let me know. Thanks very much for any help.
EDIT: I've solved part of the problem by changing the index of the second dataframe to the 'types' column with `otherdf.index = otherdf.types', but I'm still left with the problem of transferring the owned values from the first dataframe.
I agree with you that using the 'type' column as a label-based index will make things easier. With this done, you can iterate over the rows of the first dataframe, then add owned value to the appropriate row in the second dataframe using the .loc method.
for row_1 in df_1.itterrows():
owned_value = row_1[1]['owned'] #iterrows() enumeration generator over rows
mechanics = row_1[1]['mechanics_split']
for type_string in mechanics:
df_2.loc[type_string,('owned')] += owned_value
In addition, I suggest reading on how Pandas handles indexing to help avoid any 'gotchas' as you continue to work with Python.

Categories