Find "seasonality" in a categorical time series in python - python

I have the following sequence:
states_list = ['H', 'M', 'M', 'M', 'H', 'H', 'H', 'H', 'C', 'C', 'H', 'H', 'C', 'C', 'H', 'A', 'A', 'A', 'A', 'A', 'S', 'S', 'S', 'A', 'S', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'C', 'H', 'H', 'H', 'H', 'H', 'S', 'H', 'S', 'S', 'S', 'H', 'H', 'H', 'H', 'H', 'H', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'H', 'H', 'H', 'H', 'H', 'C', 'C', 'C', 'A', 'C', 'C', 'A', 'A', 'A', 'A', 'A', 'H', 'H', 'H', 'H', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C']
Is there a way to find "seasonality" on this time series ?
By "seasonality" I mean, if there is a specific a specific sub-sequence of letters popping up every "n" letters

The standard technique for seasonality detection is lagged auto correlation plot. That is, you shift your series by various time lags and check if the shifted series is correlated with the original (google acf and acf plot).
Now you have a categorical time series, so standard stuff won't work out of the box. I googled briefly, don't find anything ready made, but all the ingredients are there.
The main of which is the correlation for categorical variables, and that's Cramer's V. For example here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.association.html.
Then you will need to write some code that for each k=1, 2, 3, ... shifts the series by k, computes the Cramer's V correlation between shifted and unshifted, and saves the result.
Afther that plot k vs. correlations and see if things stand out.

Related

Python - Creating permutations with output array index constraints

I want to create all possible permutations for an array in which each element can only occur once, with constraints on the element array index position.
ID = ["A","B","C","D","E","F","G","H","I","J"]
I want to create all possible permutations of the original_array, however the positions of each element are restricted to index positions given by:
ID = ["A","B","C","D","E","F","G","H","I","J"]
Index_Options=[]
for i in range(len(ID)):
List1=[]
distance=3
value = i - distance
for j in range((int(distance)*2)):
if value < 0 or value > len(ID):
print("Disregard") #Outside acceptable distance range
else:
List1.append(value)
value=value+1
Index_Options.append(List1)
print(Index_Options)
#Index_Options gives the possible index positions for each element. ie "A" can occur in only index positions 0,1,2, "B" can occur in only index positions 0,1,2,3 ect.
I'm just struggling on how to then use this information to create all the output permutations.
Any help would be appreciated
You can use a recursive generator function to build the combinations. Instead of generating all possible permutations from ID and then filtering based on Index_Options, it is much more efficient to produce a cartesian product of ID by directly traversing Index_Options:
ID = ["A","B","C","D","E","F","G","H","I","J"]
def combos(d, c = [], s = []):
if not d:
yield c
else:
for i in filter(lambda x:x not in s and x < len(ID), d[0]):
yield from combos(d[1:], c=c+[ID[i]], s=s+[i])
print(list(combos(Index_Options)))
Output (first ten combinations produced):
[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'I'], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'H', 'J'], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'J', 'H'], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'J', 'H', 'I'], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'J', 'I', 'H'], ['A', 'B', 'C', 'D', 'E', 'F', 'H', 'G', 'I', 'J'], ['A', 'B', 'C', 'D', 'E', 'F', 'H', 'G', 'J', 'I'], ['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'G', 'J'], ['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'G']]
You can use itertools.permutations to create all the possible permutations and then create new list with a check if all the letters are in the correct position
permutations = [p for p in itertools.permutations(ID, len(ID)) if all(i in Index_Options[ID.index(x)] for i, x in enumerate(p))]

How to split/slice a list in Python

I saw this code in w3resource:
C = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n']
def list_slice(S, step):
return [S[i::step] for i in range(step)]
print(list_slice(C,3))
Output :[['a', 'd', 'g', 'j', 'm'], ['b', 'e', 'h', 'k', 'n'], ['c', 'f', 'i', 'l']]
I tried it without list comprehension and a function:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n']
step = int(input("Step of every element: "))
for i in range(step):
print(letters[i::step])
Output:
['a', 'd', 'g', 'j', 'm']
['b', 'e', 'h', 'k', 'n']
['c', 'f', 'i', 'l']
is it possible to make my output like this [['a', 'd', 'g', 'j', 'm'], ['b', 'e', 'h', 'k', 'n'], ['c', 'f', 'i', 'l']] without using list comprehension and without making another variable with an empty list?
not a list comprehension solution, but since you want an element having certain distance/step in the list to be in the same list, then you can see that those element index share single property altogether which have the same remainder to the step value, using this approach you can save those value for that index remainder in a key-value pair of dict and in result you can take the values which are nonempty.
C = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n']
def func(lis, slic):
res = {i:[] for i in range(len(lis)//slic)}
for i in range(len(lis)):
res[i%slic].append(lis[i])
return [i for i in res.values() if i!=[]]
print(func(C, 3))
# Output :[['a', 'd', 'g', 'j', 'm'], ['b', 'e', 'h', 'k', 'n'], ['c', 'f', 'i', 'l']]

Split every character from string in list

I have a list and I need to split every string into individual characters.
mylist = ['TCTAGTCCAGATAATCTGGT', 'GTGTTGGTACTGTAATGAAA', 'AGTTCTCTGGATCCTTCGGA', 'GGAATTGACGTCCCCAGGAA', 'GTCGTTGTCGTTCAGGAGTT', 'GGAGTCCGTCAGAAGAGGTC', 'GATTCCGATCAGATGAAGAA', 'CTTTCTATCGGGAAGAGGAG', 'ATGTCTTGAGATCGGGTCGT', 'ATTAAGATCCTCCATGATTC', 'ATCGTCGAAAGTAGTGGGAA']
And I need
output = ['T', 'C', 'T', ... 'A', 'A']
If tried so many ways and can't figure it out.
You can just use embedded list comprehension for this.
mylist = ['TCTAGTCCAGATAATCTGGT', 'GTGTTGGTACTGTAATGAAA', 'AGTTCTCTGGATCCTTCGGA', 'GGAATTGACGTCCCCAGGAA', 'GTCGTTGTCGTTCAGGAGTT', 'GGAGTCCGTCAGAAGAGGTC', 'GATTCCGATCAGATGAAGAA', 'CTTTCTATCGGGAAGAGGAG', 'ATGTCTTGAGATCGGGTCGT', 'ATTAAGATCCTCCATGATTC', 'ATCGTCGAAAGTAGTGGGAA']
chars = [c for s in mylist for c in s]
print(chars)
# ['T', 'C', 'T', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'A', 'T', 'A', 'A', 'T', 'C', 'T', 'G', 'G', 'T', 'G', 'T', 'G', 'T', 'T', 'G', 'G', 'T', 'A', 'C', 'T', 'G', 'T', 'A', 'A', 'T', 'G', 'A', 'A', 'A', 'A', 'G', 'T', 'T', 'C', 'T', 'C', 'T', 'G', 'G', 'A', 'T', 'C', 'C', 'T', 'T', 'C', 'G', 'G', 'A', 'G', 'G', 'A', 'A', 'T', 'T', 'G', 'A', 'C', 'G', 'T', 'C', 'C', 'C', 'C', 'A', 'G', 'G', 'A', 'A', 'G', 'T', 'C', 'G', 'T', 'T', 'G', 'T', 'C', 'G', 'T', 'T', 'C', 'A', 'G', 'G', 'A', 'G', 'T', 'T', 'G', 'G', 'A', 'G', 'T', 'C', 'C', 'G', 'T', 'C', 'A', 'G', 'A', 'A', 'G', 'A', 'G', 'G', 'T', 'C', 'G', 'A', 'T', 'T', 'C', 'C', 'G', 'A', 'T', 'C', 'A', 'G', 'A', 'T', 'G', 'A', 'A', 'G', 'A', 'A', 'C', 'T', 'T', 'T', 'C', 'T', 'A', 'T', 'C', 'G', 'G', 'G', 'A', 'A', 'G', 'A', 'G', 'G', 'A', 'G', 'A', 'T', 'G', 'T', 'C', 'T', 'T', 'G', 'A', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'C', 'G', 'T', 'A', 'T', 'T', 'A', 'A', 'G', 'A', 'T', 'C', 'C', 'T', 'C', 'C', 'A', 'T', 'G', 'A', 'T', 'T', 'C', 'A', 'T', 'C', 'G', 'T', 'C', 'G', 'A', 'A', 'A', 'G', 'T', 'A', 'G', 'T', 'G', 'G', 'G', 'A', 'A']
you can use a list comprehension to create new sub lists where each char is split.
mylist = ['TCTAGTCCAGATAATCTGGT', 'GTGTTGGTACTGTAATGAAA', 'AGTTCTCTGGATCCTTCGGA', 'GGAATTGACGTCCCCAGGAA', 'GTCGTTGTCGTTCAGGAGTT', 'GGAGTCCGTCAGAAGAGGTC', 'GATTCCGATCAGATGAAGAA', 'CTTTCTATCGGGAAGAGGAG', 'ATGTCTTGAGATCGGGTCGT', 'ATTAAGATCCTCCATGATTC', 'ATCGTCGAAAGTAGTGGGAA']
my_split_list = [[char for char in element] for element in mylist]
print(mylist)
print(my_split_list)
OUTPUT
['TCTAGTCCAGATAATCTGGT', 'GTGTTGGTACTGTAATGAAA', 'AGTTCTCTGGATCCTTCGGA', 'GGAATTGACGTCCCCAGGAA', 'GTCGTTGTCGTTCAGGAGTT', 'GGAGTCCGTCAGAAGAGGTC', 'GATTCCGATCAGATGAAGAA', 'CTTTCTATCGGGAAGAGGAG', 'ATGTCTTGAGATCGGGTCGT', 'ATTAAGATCCTCCATGATTC', 'ATCGTCGAAAGTAGTGGGAA']
[['T', 'C', 'T', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'A', 'T', 'A', 'A', 'T', 'C', 'T', 'G', 'G', 'T'], ['G', 'T', 'G', 'T', 'T', 'G', 'G', 'T', 'A', 'C', 'T', 'G', 'T', 'A', 'A', 'T', 'G', 'A', 'A', 'A'], ['A', 'G', 'T', 'T', 'C', 'T', 'C', 'T', 'G', 'G', 'A', 'T', 'C', 'C', 'T', 'T', 'C', 'G', 'G', 'A'], ['G', 'G', 'A', 'A', 'T', 'T', 'G', 'A', 'C', 'G', 'T', 'C', 'C', 'C', 'C', 'A', 'G', 'G', 'A', 'A'], ['G', 'T', 'C', 'G', 'T', 'T', 'G', 'T', 'C', 'G', 'T', 'T', 'C', 'A', 'G', 'G', 'A', 'G', 'T', 'T'], ['G', 'G', 'A', 'G', 'T', 'C', 'C', 'G', 'T', 'C', 'A', 'G', 'A', 'A', 'G', 'A', 'G', 'G', 'T', 'C'], ['G', 'A', 'T', 'T', 'C', 'C', 'G', 'A', 'T', 'C', 'A', 'G', 'A', 'T', 'G', 'A', 'A', 'G', 'A', 'A'], ['C', 'T', 'T', 'T', 'C', 'T', 'A', 'T', 'C', 'G', 'G', 'G', 'A', 'A', 'G', 'A', 'G', 'G', 'A', 'G'], ['A', 'T', 'G', 'T', 'C', 'T', 'T', 'G', 'A', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'C', 'G', 'T'], ['A', 'T', 'T', 'A', 'A', 'G', 'A', 'T', 'C', 'C', 'T', 'C', 'C', 'A', 'T', 'G', 'A', 'T', 'T', 'C'], ['A', 'T', 'C', 'G', 'T', 'C', 'G', 'A', 'A', 'A', 'G', 'T', 'A', 'G', 'T', 'G', 'G', 'G', 'A', 'A']]

Python group a range of percentage with increment and count the number of group

I have 3 groups of lists which are A1, A2, A3 as in group A, B1, B2, B3 in group B,
and C1, C2, C3 in group C.
a1 = ["ID_A1", ['T', 'T', 'C', 'C', 'A', 'C', 'A', 'G', 'C', 'T', 'T', 'T', 'T', 'C', 'G', 'C', 'C', 'A', 'A', 'G', 'C', 'T', 'C']]
a2 = ["ID_A2", ['T', 'T', 'C', 'C', 'A', 'C', 'A', 'G', 'C', 'T', 'T', 'T', 'T', 'C', 'G', 'C', 'C', 'A', 'A', 'G', 'C', 'T', 'T']]
a3 = ["ID_A3", ['T', 'T', 'C', 'C', 'A', 'C', 'A', 'G', 'C', 'T', 'T', 'T', 'T', 'C', 'G', 'C', 'C', 'A', 'A', 'G', 'C', 'T', 'G']]
b1 = ["ID_B1", ['C', 'T', 'C', 'C', 'A', 'C', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'C', 'A', 'C', 'C', 'A', 'A', 'A', 'C', 'T', 'C']]
b2 = ["ID_B2", ['C', 'T', 'C', 'C', 'A', 'C', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'C', 'A', 'C', 'C', 'A', 'A', 'A', 'C', 'A', 'C']]
b3 = ["ID_B3", ['C', 'T', 'C', 'C', 'A', 'C', 'C', 'A', 'C', 'T', 'T', 'T', 'C', 'C', 'A', 'C', 'C', 'A', 'A', 'A', 'C', 'G', 'C']]
c1 = ["ID_C1", ['T', 'T', 'C', 'C', 'A', 'C', 'A', 'A', 'C', 'T', 'T', 'T', 'T', 'C', 'G', 'C', 'C', 'A', 'A', 'G', 'C', 'T', 'T']]
c2 = ["ID_C2", ['T', 'T', 'C', 'C', 'A', 'C', 'A', 'G', 'C', 'T', 'T', 'T', 'T', 'C', 'G', 'C', 'C', 'A', 'A', 'G', 'C', 'T', 'T']]
c3 = ["ID_C3", ['T', 'T', 'C', 'C', 'A', 'C', 'A', 'G', 'C', 'T', 'T', 'T', 'T', 'C', 'G', 'C', 'C', 'A', 'A', 'G', 'C', 'T', 'G']]
data_set = [a1, a2, a3, b1, b2, b3, c1, c2, c3]
I have already compared their similarities with the codes below:
def compare(_from, _to):
similarity = 0
length = len(_from)
if len(_from) != len(_to):
raise Exception("Cannot be compared due to different length.")
for i in range(length):
if _from[i] == _to[i]:
similarity += 1
return similarity / length * 100
result = list()
for entry1 in data_set:
for entry2 in data_set:
percentage = compare(entry1[1], entry2[1])
print("Compare ", entry1[0], " to ", entry2[0], "Percentage :", round(percentage, 2))
result.append(round(percentage, 2))
print(result)
Instead of sorting all the similarities into a group according to their own value of similarities, I want it to be grouped like in a range of 95% to 96% with an increment 0.1, depends on how user want to input the range. I want it to have 0.1 increment because i have really big data but I cant insert here. When I loop the group (A compares from ID_A1 to ID_C3), every 95% to 96% will group into group A, and the number of group = 1, and when I loop the group (B compares from ID_A1 to ID_C3), every 95% to 96% will group into group B, and the number of group will be +1. The result that I want is showing the total number of groups in the range of 95% to 96%.
I would like to add something which is when in the range of 95.0% to 96.0%, IF there are 95.5% and 95.6%, how to group them as individual group?
The example output would be like:
"In the range of 95% to 96%, there is 1 group of 95.5% and 1 group of 95.6%"
"Total number of groups: ... "
PS: I need to use the number of groups to plot a graph

NetworkX shuffles nodes order

I'm beginner to programming and I'm new here, so hello!
I'm having a problem with nodes order in networkX.
This code:
letters = []
G = nx.Graph()
for i in range(nodesNum):
letter = ascii_lowercase[i]
letters.append(letter)
print letters
G.add_nodes_from(letters)
print "G.nodes = ", (G.nodes())
returns this:
['a']
['a', 'b']
['a', 'b', 'c']
['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd', 'e']
['a', 'b', 'c', 'd', 'e', 'f']
['a', 'b', 'c', 'd', 'e', 'f', 'g']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
G.nodes = ['a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'j']
While I would like to have it in normal (alphabetical) order.
Could anyone tell me what am I doing wrong?
The order is important to me, as later I'm asking user to tell me where the edges are.
Thanks in advance!
You can sort the nodes on output like this
print "G.nodes = ", sorted(G.nodes())
or similarly you can sort the edges like
print "G.edges = ", sorted(G.edges())
Aric's solution would do fine if you want to use this for printing only. However if you are going to use adjacent matrix for calculations and you want consistent matrices in different runs, you should do:
letters = []
G = nx.OrderedGraph()
for i in range(10):
letter = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'][i]
letters.append(letter)
print (letters)
G.add_nodes_from(letters)
print ("G.nodes = ", (G.nodes()))
which returns
['a']
['a', 'b']
['a', 'b', 'c']
['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd', 'e']
['a', 'b', 'c', 'd', 'e', 'f']
['a', 'b', 'c', 'd', 'e', 'f', 'g']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
G.nodes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Categories