I have a dictionary of connected components of a graph, for example:
d = {'A':[1,5,7],'B':[2,4], 'C':[3,6]}
and I want to create a dataframe of the form:
cc node
0 A 1
1 A 5
2 A 7
3 B 2
4 B 4
5 C 3
6 C 6
I can do this by creating a df for each key in the dictionary and then connecting them, but I am looking for a better solution.
Check with explode
pd.Series(d).explode()
d = {'A':[1,5,7],'B':[2,4], 'C':[3,6]}
df = []
for k,v in d.items():
for i in v:
df.append([k,i])
for l in df:
print(l)
['A', 1]
['A', 5]
['A', 7]
['B', 2]
['B', 4]
['C', 3]
['C', 6]
Maybe something like this:
import pandas as pd
d = {'A':[1,5,7],'B':[2,4], 'C':[3,6]}
temp = [[key, n] for key, val in d.items() for n in val]
df = pd.DataFrame(df, columns=['cc', 'node'])
print(df)
Output:
cc node
0 A 1
1 A 5
2 A 7
3 B 2
4 B 4
5 C 3
6 C 6
IIUC,
cc, node = [], []
for key, value in d.items():
cc += [key] * len(value)
node += value
df = pd.DataFrame({'cc' : cc, 'node' : node})
print(df)
Output
cc node
0 A 1
1 A 5
2 A 7
3 B 2
4 B 4
5 C 3
Related
I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I want to add a column that looks like the one below. The data will be in the form of a1,1 in the column, where the first element represent the event frequency (a1) and the second element (,1) is the frequency for each event. Is there a way we can do this using python?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can use:
# identify changes in Data
m = df['Data'].ne(df['Data'].shift()).cumsum()
# cumulated increments within groups
g1 = df.groupby(m).cumcount().add(1).astype(str)
# increments of different subgroups per Data
g2 = (df.loc[~m.duplicated(), 'Data']
.groupby(df['Data']).cumcount().add(1)
.reindex(df.index, method='ffill')
.astype(str)
)
df['Frequency'] = df['Data'].add(g2+','+g1)
output:
Data Frequency
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 a a1,5
5 b b1,1
6 b b1,2
7 b b1,3
8 a a2,1
9 a a2,2
10 b b2,1
Code:
from itertools import groupby
k = [key for key, _group in groupby(df['Data'].tolist())] #OUTPUT ['a', 'b', 'a', 'b']
Key = [v+f'{k[:i].count(v)+1}' for i,v in enumerate(k)] #OUTPUT ['a1', 'b1', 'a2', 'b2']
Sum = [sum(1 for _ in _group) for key, _group in groupby(df['Data'].tolist())] #OUTPUT [4, 3, 2, 1]
df['Frequency'] = [f'{K},{S}' for I, K in enumerate(Key) for S in range(1, Sum[I]+1)]
Output:
Data Frequency
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 b b1,1
5 b b1,2
6 b b1,3
7 a a2,1
8 a a2,2
9 b b2,1
def function1(dd:pd.DataFrame):
dd2=dd.assign(col2=dd.col1.ne(dd.col1.shift()).cumsum())\
.assign(col2=lambda dd:dd.Data+dd.col2.astype(str))\
.assign(rk=dd.groupby('col1').col1.transform('cumcount').astype(int)+1)\
.assign(col3=lambda dd:dd.col2+','+dd.rk.astype(str))
return dd2.loc[:,['Data','col3']]
df1.assign(col1=df1.ne(df1.shift()).cumsum()).groupby(['Data']).apply(function1)
Data col3
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 a a1,5
5 b b1,1
6 b b1,2
7 b b1,3
8 a a2,1
9 a a2,2
10 b b2,1
Let's say I have the following Pandas dataframe. It is what it is and the input can't be changed.
df1 = pd.DataFrame(np.array([['a', 1,'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1,1,2,2]
See how the columns have the same name? The output I want is to have columns with the same name combined (not summed or concatenated), meaning the second column 1 is added to the end of the first column 1, like so:
df2 = pd.DataFrame(np.array([['a', 'e'],
['b','f'],
['c', 'g'],
['d', 'h'],
[1,5],
[2,6],
[3,7],
[4,8]]))
df2.columns = [1,2]
How do I do this? I can do it manually, except I actually have like 10 column titles, about 100 iterations of each title, and several thousand rows, so it takes forever and I have to redo it with each new dataset.
EDIT: the columns in actual datasets are unequal in length.
Try with groupby and explode:
output = df1.groupby(level=0, axis=1).agg(lambda x: x.values.tolist()).explode(df1.columns.unique().tolist())
>>> output
1 2
0 a e
0 1 5
1 b f
1 2 6
2 c g
2 3 7
3 d h
3 4 8
Edit:
To reorder the rows, you can do:
output = output.assign(order=output.groupby(level=0).cumcount()).sort_values("order",ignore_index=True).drop("order",axis=1)
>>> output
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Depending on the size of your data, you could split the data into a dictionary and then create a new data frame from that:
df1 = pd.DataFrame(np.array([['a', 1, 'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1, 1, 2, 2]
dictionary = {}
for column in df1.columns:
items = []
for item in df1[column].values.tolist():
items += item
dictionary[column] = items
new_df = pd.DataFrame(dictionary)
print(new_df)
You can use a dictionary whose default value is list and loop through the dataframe columns. Use the column name as dictionary key and append the column value to the dictionary value.
from collections import defaultdict
d = defaultdict(list)
for i, col in enumerate(df1.columns):
d[col].extend(df1.iloc[:, i].values.tolist())
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For df1.columns = [1,1,2,3], the output is
1 2 3
0 a e 5
1 b f 6
2 c g 7
3 d h 8
4 1 None None
5 2 None None
6 3 None None
7 4 None None
If I understand correctly, this seems to work:
pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Output:
In [3]: pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Out[3]:
value value
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For each row, I am computing values and storing them in a dictionary. I want to be able to take the dictionary and add it to the row where the keys are columns.
For example:
Dataframe
A B C
1 2 3
Dictionary:
{
'D': 4,
'E': 5
}
Result:
A B C D E
1 2 3 4 5
There will be more than one row in the dataframe, and for each row I'm computing a dictionary that might not necessarily have the same exact keys.
I ended up doing this to get it to work:
appiled_df = df.apply(lambda row: func(row['a']), axis='columns', result_type='expand')
df = pd.concat([df, appiled_df], axis='columns')
def func():
...
return pd.Series(dictionary)
If you want the dict values to appear in each row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
Demo
Data Input:
df
A B C
0 1 2 3
1 11 12 13
Output:
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
A B C D E
0 1 2 3 4 5
1 11 12 13 4 5
If you just want the dict to appear in the first row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(pd.Series(d).to_frame().T)
A B C D E
0 1 2 3 4.0 5.0
1 11 12 13 NaN NaN
Simply use a for cycle in your dictionary and assign the values.
df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3]])
# You can test with df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3], [8,0,33]]), too.
d = {
'D': 4,
'E': 5
}
for k,v in d.items():
df[k] = v
print(df)
Output:
A
B
C
D
E
0
1
2
3
4
5
I have a df looks like below, I would like to get rows from 'D' column based on my list without changing or unique the order of list. .
A B C D
0 a b 1 1
1 a b 1 2
2 a b 1 3
3 a b 1 4
4 c d 2 5
5 c d 3 6 #df
My list
l = [4, 2, 6, 4] # my list
df.loc[df['D'].isin(l)].to_csv('output.csv', index = False)
When I use isin() the result would change the order and unique my result, df.loc[df['D'] == value only print the last line.
A B C D
3 a b 1 4
1 a b 1 2
5 c d 3 6
3 a b 1 4 # desired output
Any good way to do this? Thanks,
A solution without loop but merge:
In [26]: pd.DataFrame({'D':l}).merge(df, how='left')
Out[26]:
D A B C
0 4 a b 1
1 2 a b 1
2 6 c d 3
3 4 a b 1
You're going to have to iterate over your list, get copies of them filtered and then concat them all together
l = [4, 2, 6, 4] # you shouldn't use list = as list is a builtin
cache = {}
masked_dfs = []
for v in l:
try:
filtered_df = cache[v]
except KeyError:
filtered_df = df[df['D'] == v]
cache[v] = filtered_df
masked_dfs.append(filtered_df)
new_df = pd.concat(masked_dfs)
UPDATE: modified my answer to cache answers so that you don't have to do multiple searches for repeats
just collect the indices of the values you are looking for, put in a list and then use that list to slice the data
import pandas as pd
df = pd.DataFrame({
'C' : [6, 5, 4, 3, 2, 1],
'D' : [1,2,3,4,5,6]
})
l = [4, 2, 6, 4]
i_locs = [ind for elem in l for ind in df[df['D'] == elem].index]
df.loc[i_locs]
results in
C D
3 3 4
1 5 2
5 1 6
3 3 4
I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''