I want to plot a transition between multiple groups in python. Say I have three groups A, B and C at a given datetime x. Now at datetime y > x I want to visualize what % of elements of A transitioned into group B, what % to C. Similarly for B and C. I can for now assume that there are a fixed number of elements. Also can I extend this to multiple dates like x < y < z and visualize the changes ?
A sample dataframe of my usecase can be generated using this code
elements = [f'e{i}' for i in range(10)]
x = pd.DataFrame({'element': elements, 'group': np.random.choice(['A', 'B', 'C'], size=10), 'date': pd.to_datetime('2021-04-01')})
y = pd.DataFrame({'element': elements, 'group': np.random.choice(['A', 'B', 'C'], size=10), 'date': pd.to_datetime('2021-04-10')})
df = x.append(y)
Now from the above dataframe I want to visualize for the 2 dates how did the transition from groups A, B and C happened.
My main issue is I don't know what plot to use in python to visualize this, any leads will be really helpful.
Here's an approach to get what you need, i.e. shift from one date to another:
# pivot the data so dates become columns
s = df.pivot(index='element', columns='date', values='group')
which gives s as:
date 2021-04-01 2021-04-10
element
e0 A A
e1 A C
e2 B B
e3 B B
e4 C C
e5 A C
e6 B B
e7 C A
e8 C A
e9 C A
Next,
# compare the two consecutive dates
pairwise = pd.get_dummies(s.iloc[:, 1]).T # pd.get_dummies(s.iloc[:,0])
which gives you pairwise as:
A B C
A 1 0 3
B 0 3 0
C 2 0 1
That means, e.g. first column says that there are 3 A's on the first date, one stays A and 2 change to C on the second date. Finally, you can easily compute the percentage with
pairwise / pairwise.sum()
Output, which you can use something like sns.heatmap to visualize:
A B C
A 0.333333 0.0 0.75
B 0.000000 1.0 0.00
C 0.666667 0.0 0.25
Note as to the extended question, you would have a series of these matrices for each pair (day1, day2), (day2, day3),.... It's up to you to decide how to visualize them.
Related
I have a data frame consisting of 5.1 mio rows. Now, consider only a query of my data frame
df_queried = df.query("ID1=='a' or ID2=='Y'")
which has the following form:
date
ID1
ID2
201908
a
X
201905
b
Y
201811
a
Y
201807
a
Z
You can assume that the date is sorted and that there are no duplicates in the subset ['ID1', 'ID2'].
Now, the goal is to check whether there are ID2 duplicates that contain more than one ID1 value. If that’s the case, then assign the most recent ID1 value from that list to a new column for each ID1 in that list.
For the special query of my data frame:
Y is a duplicate of ID2 containing different values for ID1, namely ['a', 'b']. Now, we have to find the most recent value from the list and assign it to the new column for all ID1 values that are in the list.
Output:
date
ID1
ID2
New_ID
201908
a
X
a
201905
b
Y
a
201811
a
Y
a
201807
a
Z
a
where New_ID equals the most recent value of ID1 and follows the following rules:
Within each ID2 attribute New_ID must have the same and most recent value
Example:
This obviously holds for ID2=X and ID2=Z. For ID2=Y there are two values for ID1, {a, b}. b must be overwritten with the most recent ID1 value of this segment.
If there is more than one ID1 value within an ID2 value, then find all rows for which ID1 equals one of those values and assign the most recent one
Example: For ID2=Y, ID1 contains two values, a and b. Now, for each ID1==a or ID1==b, the new columns New_ID must equal the most recent value of ID1 independent of ID2.
I am able to achieve this:
date
ID1
ID2
New_ID
201908
a
X
b
201905
b
Y
b
201811
a
Y
b
201807
a
Z
b
using the following loop:
df_queried['New_ID'] = df_queried['ID1']
for v2 in df_queried.ID2.unique():
# Query data frame by ID2 value
df_query1 = df_queried.query(f'ID2 == {v2!r}')
# Get most recent value
most_recent_val = df_query1.iloc[0, 1]
# Define unique ID1 values within ID2 query
unique_ID1_vals = df_query1.ID1.unique()
# If several ID1 values were found, check if one val
# also occurs in different ID1 position
if len(unique_ID1_vals) > 1:
for v1 in unique_ID1_vals:
# Get id1 query to check existence of multiple id2's
df_queried.loc[df_queried['ID1'] == v1, 'New_ID'] = most_recent_val
Now, I can join the actual value a to the new column:
mapping = df_queried.drop_duplicates(subset=['New_ID'])[['ID1', 'New_ID']]
pd.merge(df_queried, mapping.rename(columns={'ID1': 'ID_temp'}), how='left')\
.drop(columns=['New_ID'])\
.rename(columns={'ID_temp': 'New_ID'})
which yields the desired result.
However, it takes way too long. I was thinking about a smarter approach. One that mainly relies on joins. But I was not able to find one.
Note: Obviously, I want to operate over the whole data frame not only on the queried one. Therefore, the code must be stable and applicable to the whole data frame. I think my code is, but I did not try it out on the whole data (after 6 hours I killed the kernel). I also tried to use numba, but failed to fully implement it.
I hope my problem got clear.
EDIT 1:
df_queried['New_ID'] = df_queried.groupby('ID2')['ID1'].transform('last')
This approach indeed works for this special case. However, if it is applied to a larger subset of my data, for instance:
date
ID1
ID2
New_ID
New_ID_desired
201908
a
X
a
a
201905
b
Y
a
a
201811
a
Y
a
a
201807
a
Z
a
a
202003
c
H
d
c
202001
d
H
d
c
201907
c
I
c
c
201904
d
J
d
c
the method does not hold anymore. It satisfies rule 1, but not rule 2.
However, when you use my approach, you get:
date ID1 ID2 New_ID
0 201906 a X a
1 201903 b Y a
2 201811 a Y a
3 201802 a Z a
4 202003 c H c
5 202001 d H c
6 201907 c I c
7 201904 d J c
If your data is sorted by date, then I believe what you want is simply:
df['New_ID'] = df.groupby('ID2')['ID1'].transform('last')
output:
date ID1 ID2 New_ID
0 201908 a X a
1 201905 b Y a
2 201811 a Y a
3 201807 a Z a
Okay, after googling and thinking about an approach I finally found one using the library networkx. I wanted to share it for the case someone else is/will be facing the same problem. Basically, I have a bipartit graph that I want to decompose in connected components. You can define the following functions and get the desired result as follows:
import pandas as pd
import networkx as nx
from itertools import chain
df_sub = pd.DataFrame(
data=dict(
date=[201906, 201903, 201811, 201802, 202003, 202001, 201907, 201904],
ID1=["a", "b", "a", "a", "c", "d", "c", "d"],
ID2=["X", "Y", "Y", "Z", "H", "H", "I", "J"]
)
)
def _graph_decomposition(graph_as_df: pd.DataFrame) -> list:
# Initialize Graph (in my case, bipartit graph)
G = nx.Graph()
# Get connections
G.add_edges_from(graph_as_df.drop_duplicates().to_numpy().tolist())
# Create list containing connected components
connected_components = list(nx.connected_components(G))
return connected_components
def stabilized_ID(graph_as_df: pd.DataFrame) -> pd.DataFrame:
components: list= _graph_decomposition(graph_as_df)
# Chain components -> list of list to only one list
ID1_mapping = list(chain.from_iterable(components))
ID1_true = []
for component in components:
# Convert set to list
component = list(component)
# For my case, ID2 starts always with '0' and ID1 always with 'C'
# and max(['C', '0999999']) = 'C'
ID1_true += [max(component)] * len(component)
# Assert length are equal
assert len(ID1_true) == len(ID1_mapping)
# Define final mapping
mapping = pd.DataFrame(data={'ID1': ID1_mapping, 'ID1_true': ID1_true})
return mapping
mapping = stabilized_ID(df_sub[['ID1', 'ID2']])
pd.merge(df_sub, mapping, on=['ID1'], how='inner')
This approach takes 40 seconds for my whole data frame that consists of 5.1 mio rows (the merge operation alone takes 34 seconds). It produces the following data frame:
date ID1 ID2 ID1_true
0 201906 a X b
1 201811 a Y b
2 201802 a Z b
3 201903 b Y b
4 202003 c H d
5 201907 c I d
6 202001 d H d
7 201904 d J d
Since I made the next steps time-independent, I do not need the most recent value anymore. Now, it is only important to me that the ID_New values are equal to one of the connected components from ID1, not to the most recent one. If needed, one could also map the most recent ID1 value as described in my question.
I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.
I have a pandas DataFrame df that portrays edges of a directed acyclic graph, sorted by Target:
Source Target
C A
D A
A B
C B
D B
E B
E C
C D
E D
I would like to add a column Weight based on occurrences of values.
Weight should illustrate the number of appearance of the Target value in Target divided by the number of appearance of the Source value in Target.
In other words, the first row of the example should have the Weight of 2/1 = 2, since A appears twice in Target where C appears only once in Target.
I have first tried
df.apply(pd.Series.value_counts)
but the problem is my actual DataFrame is extremely large, so I am not able to manually search for each occurrence value from the outcome and make a quotient. I have also tried to write two new columns that signify the values I need, then to write a final column that consists of what I want:
df['tfreq'] = df.groupby('Target')['Target'].transform('count')
df['sfreq'] = df.groupby('Source')['Target'].transform('count')
but it seems like my second line of code returns the occurrences of Source values in Source column instead of Target column.
Are there any insights on this problem?
Use value_counts with map. Then divide them:
val_counts = df['Target'].value_counts()
counts1 = df['Target'].map(val_counts)
counts2 = df['Source'].map(val_counts)
df['Weights'] = counts1.div(counts2) # same as counts1 / counts2
Output
Source Target Weights
0 C A 2.0
1 D A 1.0
2 A B 2.0
3 C B 4.0
4 D B 2.0
5 E B NaN
6 E C NaN
7 C D 2.0
8 E D NaN
note: we get NaN because E does not occur in column Target
I have been trying to select a subset of a correlation matrix using the Pandas Python library.
For instance, if I had a matrix like
0 A B C
A 1 2 3
B 2 1 4
C 3 4 1
I might want to select a matrix where some of the variables in the original matrix are correlated with some of the other variables, like :
0 A C
A 1 3
C 3 1
To do this, I tried using the following code to slice the original correlation matrix using the names of the desired variables in a list, transpose the correlation matrix, reassign the original column names, and then slice again.
data = pd.read_csv("correlationmatrix.csv")
initial_vertical_axis = pd.DataFrame()
for x in var_list:
a = data[x]
initial_vertical_axis = initial_vertical_axis.append(a)
print initial_vertical_axis
initial_vertical_axis = pd.DataFrame(data=initial_vertical_axis, columns= var_list)
initial_matrix = pd.DataFrame()
for x in var_list:
a = initial_vertical_axis[x]
initial_matrix = initial_matrix.append(a)
print initial_matrix
However, this returns an empty correlation matrix with the right row and column labels but no data like
0 A C
A
C
I cannot find the error in my code that would lead to this. If there is a simpler way to go about this, I am open to suggestions.
Suppose data contains your matrix,
In [122]: data
Out[122]:
A B C
0
A 1 2 3
B 2 1 4
C 3 4 1
In [123]: var_list = ['A','C']
In [124]: data.loc[var_list,var_list]
Out[124]:
A C
0
A 1 3
C 3 1
What is the easiest way to create a DataFrame with hierarchical columns?
I am currently creating a DataFrame from a dict of names -> Series using:
df = pd.DataFrame(data=serieses)
I would like to use the same columns names but add an additional level of hierarchy on the columns. For the time being I want the additional level to have the same value for columns, let's say "Estimates".
I am trying the following but that does not seem to work:
pd.DataFrame(data=serieses,columns=pd.MultiIndex.from_tuples([(x, "Estimates") for x in serieses.keys()]))
All I get is a DataFrame with all NaNs.
For example, what I am looking for is roughly:
l1 Estimates
l2 one two one two one two one two
r1 1 2 3 4 5 6 7 8
r2 1.1 2 3 4 5 6 71 8.2
where l1 and l2 are the labels for the MultiIndex
This appears to work:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.concat({"Estimates": pd.DataFrame(data)}, axis=1, names=["l1", "l2"])
l1 Estimates
l2 a b c
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
I know the question is really old but for pandas version 0.19.1 one can use direct dict-initialization:
d = {('a','b'):[1,2,3,4], ('a','c'):[5,6,7,8]}
df = pd.DataFrame(d, index=['r1','r2','r3','r4'])
df.columns.names = ('l1','l2')
print df
l1 a
l2 b c
r1 1 5
r2 2 6
r3 3 7
r4 4 8
Im not sure but i think the use of a dict as input for your DF and a MulitIndex dont play well together. Using an array as input instead makes it work.
I often prefer dicts as input though, one way is to set the columns after creating the df:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.DataFrame(np.array(data.values()).T, index=['r1','r2','r3','r4'])
tups = zip(*[['Estimates']*len(data),data.keys()])
df.columns = pd.MultiIndex.from_tuples(tups, names=['l1','l2'])
l1 Estimates
l2 a c b
r1 1 10 100
r2 2 20 200
r3 3 30 300
r4 4 40 400
Or when using an array as input for the df:
data_arr = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
tups = zip(*[['Estimates']*data_arr.shape[0],['a','b','c'])
df = pd.DataFrame(data_arr.T, index=['r1','r2','r3','r4'], columns=pd.MultiIndex.from_tuples(tups, names=['l1','l2']))
Which gives the same result.
The solution by Rutger Kassies worked in my case, but I have
more than one column in the "upper level" of the column hierarchy.
Just want to provide what worked for me as an example since it is a more general case.
First, I have data with that looks like this:
> df
(A, a) (A, b) (B, a) (B, b)
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
2 8.51 9.60 66.67 50.70
3 0.03 508.99 56.00 8.58
I would like it to look like this:
> df
A B
a b a b
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
...
The solution is:
tuples = df.transpose().index
new_columns = pd.MultiIndex.from_tuples(tuples, names=['Upper', 'Lower'])
df.columns = new_columns
This is counter-intuitive because in order to create columns, I have to do it through index.