how to create a dict from row in sublist? - python

With this DataFrame:
Gp Wo Me CHi
1 0 1 0
2 1 0 0
3 0 1 0
4 1 0 0
5 0 2 0
6 1 0 0
I would like create a dictionary like :
a={'Gp':['Wo', 'Me','CHi']}
but in the case column 'Gp' row 5 the value of column 'Me' is 2 ,how I can convert like this value :
a={5:[0, [1,1],0]}
Like create another list if the value is > 1:

You can use df.itterrows() and check if the row value of the column 'Me' is equal to 2 and write an if statement:
for index, row in df.iterrows():
if row['Me'] == 2:
print({row['Gp']: [row['Wo'], [1, 1], row['CHi']]})
else:
print({row['Gp']: [row['Wo'], row['Me'], row['CHi']]})
This will output the following dictionaries:
{1: [0, 1, 0]}
{2: [1, 0, 0]}
{3: [0, 1, 0]}
{4: [1, 0, 0]}
{5: [0, [1, 1], 0]}
{6: [1, 0, 0]}
EDIT Based on the comment:
for index, row in df.iterrows():
if row['Me'] <= 1:
print({row['Gp']: [row['Wo'], row['Me'], row['CHi']]})
else:
print({row['Gp']: [row['Wo'], [1 for _ in range(row['Me'])], row['CHi']]})

Related

How to apply different functions to different columns after groupby like sum and .apply(list)? (Python)

I have a dataframe where I want to group rows based on a column. Some of the columns in the rows I want to sum up and the others I want to aggregate as a list.
#creating sample data
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['id'] = [1,2,1,4]
df['group'] = [[0,1,2,3] , [0,2,3,4], [1,1,1,1], 1]
df
Out[5]:
a b c d id group
0 0.850058 0.160497 0.742296 0.354296 1 [0, 1, 2, 3]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.160764 0.671702 0.414800 0.429992 1 [1, 1, 1, 1]
3 0.011089 0.581518 0.718829 0.610140 4 1
Here I want to combine row 0 and row 2 as they have the same id. When doing this, I want to sum up the values in columns a, b, c and d but for column group, I want the lists to be appended. How can I do this?
My expected output is:
a b c d id group
0 1.155671 1.670582 0.392744 0.681494 1 [0, 1, 2, 3, 1, 1, 1, 1]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.011089 0.581518 0.718829 0.610140 4 1
(When I use only the sum or df.groupby(['id'])['group'].apply(list), the other columns are dropped. )
Use groupby.aggregate
df.groupby('id').agg({k: sum for k in ['a', 'b', 'c', 'd', 'group']})
A one-liner alternative would be using numeric_only flag. But be careful with the columns you are feeding in.
df.groupby('id').sum(numeric_only=False)
Output
a b c d group
id
1 1.488778 0.802794 0.949768 0.952676 [0, 1, 2, 3, 1, 1, 1, 1]
2 0.488390 0.512301 0.064922 0.233875 [0, 2, 3, 4]
4 0.649945 0.267125 0.229313 0.156696 1
First Solution:
We can arrive at the task in 2 steps, the 1st step using GroupBy.sum to get the grouped sum of the first 4 columns. The 2nd step acting on the column group only and concat the lists also by GroupBy.sum
df.groupby('id').sum().join(df.groupby('id')['group'].sum()).reset_index()
Input (Different values owing to the different random numbers generated)
a b c d id group
0 0.758148 0.781987 0.310849 0.600912 1 [0, 1, 2, 3]
1 0.694848 0.755622 0.947359 0.708422 2 [0, 2, 3, 4]
2 0.515446 0.454484 0.169883 0.697287 1 [1, 1, 1, 1]
3 0.361939 0.325718 0.143510 0.077142 4 1
Output:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Second Solution
We can also use GroupBy.agg with named aggegation, as follows:
df.groupby('id', as_index=False).agg(a=('a', 'sum'), b=('b', 'sum'), c=('c', 'sum'), d=('d', 'sum'), group=('group', 'sum'))
Result:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Does this work:
pd.merge(df.groupby('id', as_index = False).sum(), df.groupby('id')['group'].apply(sum).reset_index(), on = 'id')
id a b c d group
0 1 1.241602 0.839409 0.779673 0.639509 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.967984 0.838906 0.313017 0.498611 [0, 2, 3, 4]
2 4 0.042871 0.367209 0.676656 0.178939 1

How do I create a co-occurrance matrix in Python?

I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.
For example, my dataframce can be something like (N=3):
A B C
0 0 2 0
1 1 0 1
2 2 2 0
3 2 0 0
4 0 0 0
I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.
A B C
A x 2 3
B 2 x 2
C 3 2 x
Where, for example, matrix[0, 1] means that A and B assume the same value 2 times.
I don't care about the value on the diagonal.
What is the smartest way to do that?
DataFrame.corr
We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other
df.corr(method=lambda x, y: (x==y).sum())
A B C
A 1.0 2.0 3.0
B 2.0 1.0 2.0
C 3.0 2.0 1.0
Let's try broadcasting across the transposition and summing axis 2:
import pandas as pd
df = pd.DataFrame({
'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})
vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)
e:
[[5 2 3]
[2 5 2]
[3 2 5]]
Turn back into a dataframe:
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
new_df:
A B C
A 5 2 3
B 2 5 2
C 3 2 5
I don't know about the smartest way but I think this works:
import numpy as np
m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3
ans = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])
print(ans + ans.T)

How to link an array of numbers to another array of numbers?

Im a beginner in python and im currently working on a problem on code forces called Lecture Sleep. The question gives you 3 lines of inputs:
6 3
1 3 5 2 5 4
1 1 0 1 0 0
I'm trying to figure out how to link the second array of numbers (1 3 5 2 5 4) to the 3rd array of numbers (1 1 0 1 0 0). So that 1 = 1, 3 = 1, 5 = 0, 2 = 1, 5 = 0, 4 = 0.
Best I can think of is zipping the two together:
lst1, lst2 = [1, 3, 5, 2, 5, 4], [1, 1, 0, 1, 0, 0]
for x, y in zip(lst1, lst2):
print("{} = {}".format(x, y))
Which yields
1 = 1
3 = 1
5 = 0
2 = 1
5 = 0
4 = 0
This gives you a dictionary where every item of the first list is the key to the item of the second list:
lst1 = [1, 3, 5, 2, 5, 4]
lst2 = [1, 1, 0, 1, 0, 0]
res = dict((zip(lst1, lst2)))
print(res)
#{1: 1, 3: 1, 5: 0, 2: 1, 4: 0}
You are needing something called a dictionary.
So if you have the following:
li1 = [1,3,5,2,5,4]
li2 = [1,1,0,1,0,0]
mydict = {} #declares dictionary object
for i in range(len(li1)):
mydict[li1[i]] = li2[i]
print(mydict)
gives the following output of the dictionary object:
{1: 1, 3: 1, 5: 0, 2: 1, 4: 0}
So these numbers are linked together now.

How to convert a dictionary into a tensor in tensorflow

This is the dictionary I have:
docs = {'computer': {'1': 1, '3': 5, '8': 2},
'politics': {'0': 2, '1': 2, '3': 1}}
I want to create a 9 * 2 tensor like this:
[
[0, 1, 0, 5, 0, 0, 0, 0, 2],
[2, 2, 0, 1, 0, 0, 0, 0, 0, 0]
]
Here, because the max item is 8 so we have 9 rows. But, the number of rows and columns can increase based on the dictionary.
I have tried to implement this using for-loop though as the dictionary is big it's not efficient at all and also it implemented using the list I need that to be a tensor.
maxr = 0
for i, val in docs.items():
for j in val.keys():
if int(j) > int(maxr):
maxr = int(j)
final_lst = []
for val in docs.values():
lst = [0] * (maxr+1)
for j, val2 in sorted(val.items()):
lst[int(j)] = val2
final_lst.append(lst)
print(final_lst)
If you are ok with using pandas and numpy, here's how you can do it.
import pandas as pd
import numpy as np
# Creates a dataframe with keys as index and values as cell values.
df = pd.DataFrame(docs)
# Create a new set of index from min and max of the dictionary keys.
new_index = np.arange( int(df.index.min()),
int(df.index.max())).astype(str)
# Add the new index to the existing index and fill the nan values with 0, take a transpose of dataframe.
new_df = df.reindex(new_index).fillna(0).T.astype(int)
# 0 1 2 3 4 5 6 7
#computer 0 1 0 5 0 0 0 0
#politics 2 2 0 1 0 0 0 0
If you just want the array, you can call array = new_df.values.
#[[0 1 0 5 0 0 0 0]
# [2 2 0 1 0 0 0 0]]
If you want tensor, then you can use tf.convert_to_tensor(new_df.values)

Intersection of multiple rows in single DataFrame

I have a DataFrame of Temperature 1000s of rows(Time series data) and 40 columns(40 points in a catchment ). Entries in this DataFrame are zeros and one (1 means active part of catchment and zero means non-active part). I want to place number of intersected values in a separate column(named inter) in the same DataFrame .
I expect the output in this way [attached image]
value in the first row of inter should be zero as all entries are zero
and no part is active on day first
value in the 2nd row of inter should be 4 as four parts are active
on day 2.
value in the 3rd row of inter should be 3 (number of intersected values
of all above rows including 3rd row)[enter image description here][1].
Green boxes in image show the value for 3rd row
value in 4th row of inter should be number of intersected values of
all above rows (yellow shaded area in the image).
similarly blue boxes show the value for 5th row and red boxes show
the value for sixth row and so on
Note: for every row I will count the intersection of all above rows
I deserve a reward for this :)
Here is you answer:
import pandas as pd
import numpy as np
# setup test data
data = {'0': [0, 0, 0, 1, 0], '1': [0, 0, 1, 0, 1], '2': [0, 0, 0, 1, 0], '3': [0, 0, 1, 1, 1], '4': [0, 1, 1, 1, 0]
, '5': [0, 0, 0, 0, 1], '6': [0, 1, 1, 1, 0], '7': [0, 0, 1, 0, 1], '8': [0, 1, 0, 1, 0], '9': [0, 1, 1, 0, 0],
'10': [0, 0, 1, 0, 0], '11': [0, 0, 0, 1, 1], '12': [0, 0, 0, 1, 1]}
data = pd.DataFrame(data=data)
# collect inter data
inter_data = []
for main_index, main_row in data.iterrows():
# select data for calculations
selected_data = data.loc[0:main_index,:]
# handle firs row with 0 values
if not 1 in main_row.values:
inter_data.append(0)
else:
# handle second row
if selected_data.shape[0] == 2:
inter_data.append(selected_data[1:2].values[0].sum())
# handle rest of data
else:
# drop last row from selected data
selected_data = selected_data[:-1]
# sum selected data
summed_data = 0
for index, row in selected_data.iterrows():
summed_data += row.values
# get position of 1
positions = np.where(main_row.values == 1)
# get summed data based on position
positions_data = summed_data[positions[0]]
# sum occurance in data
inter_data.append((positions_data >= 1).sum())
# add inter data to raw data
data['inter'] = pd.DataFrame(inter_data)
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12 inder
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 1 0 1 1 0 0 0 4
2 0 1 0 1 1 0 1 1 0 1 1 0 0 3
3 1 0 1 1 1 0 1 0 1 0 0 1 1 4
4 0 1 0 1 0 1 0 1 0 0 0 1 1 5

Categories