I am a data scientist and am working with a text file that specifies how many datasets I have for a specific participant by printing the participant's ID on a new line for each dataset. The second column counts the number of different participants, like so
a 1
a 1
a 1
b 2
b 2
c 3
d 4
d 4
d 4
I now need to create a matrix which has a column for each participant and specifies what lines refer to that participant by giving it a value of 1 vs 0. I have over 2000 participants, so I cannot do this by hand or write out all column numbers and what to print where but have to create a rule.
The number of columns in my file will be the number in the last row of column 2 + 2 (in the example that should be 4 + 2 = 6). Basically, for each row, I need to print a 1 in columns that match the (value in column 2 (participants number) + 2). For that row, all other columns get the value of 0. So for row 1, column (1+2=)3 gets a 1, all other columns get a value of 0. For row 2, column (1+2=)3 gets a 1, all other columns get a value of 0, etc.
This should look like this:
a 1 1 0 0 0
a 1 1 0 0 0
a 1 1 0 0 0
b 2 0 1 0 0
b 2 0 1 0 0
c 3 0 0 1 0
d 4 0 0 0 1
d 4 0 0 0 1
d 4 0 0 0 1
I wish I could provide code that I have tried, but I don't know where to start.
Hope anyone can help. Thanks!
awk to the rescue!
$ awk 'NR==FNR{if(max<$2)max=$2; next}
{printf "%s %s", $1,$2;
for(i=1;i<=max;i++) printf " %s", i==$2;
print ""}' file{,}
a 1 1 0 0 0
a 1 1 0 0 0
a 1 1 0 0 0
b 2 0 1 0 0
b 2 0 1 0 0
c 3 0 0 1 0
d 4 0 0 0 1
d 4 0 0 0 1
d 4 0 0 0 1
with this double scan algorithm the consistency and order doesn't matter.
Related
I have a problem in which I want to take Table 1 and turn it into Table 2 using Python.
Does anybody have any ideas? I've tried to split the Value column from table 1 but run into issues in that each value is a different length, hence I can't always define how much to split it.
Equally I have not been able to think through how to create a new column that counts the position that value was in the string.
Table 1, before:
ID
Value
1
000000S
2
000FY
Table 2, after:
ID
Position
Value
1
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
S
2
1
0
2
2
0
2
3
0
2
4
F
2
5
Y
You can split the string to individual characters and explode:
out = (df
.assign(Value=df['Value'].apply(list))
.explode('Value')
)
output:
ID Value
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 S
1 2 0
1 2 0
1 2 0
1 2 F
1 2 Y
Given:
ID Value
0 1 000000S
1 2 000FY
Doing:
df.Value = df.Value.apply(list)
df = df.explode('Value')
df['Position'] = df.groupby('ID').cumcount() + 1
Output:
ID Value Position
0 1 0 1
0 1 0 2
0 1 0 3
0 1 0 4
0 1 0 5
0 1 0 6
0 1 S 7
1 2 0 1
1 2 0 2
1 2 0 3
1 2 F 4
1 2 Y 5
How do I do this operation using pandas?
Initial Df:
A B C D
0 0 1 0 0
1 0 1 0 0
2 0 0 1 1
3 0 1 0 1
4 1 1 0 0
5 1 1 1 0
Final Df:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Basically Param is the number of the 1 in that row which is appearing for the first time in its own column.
Example:
index 0 : 1 in the column B is appearing for the first time hence Param1 = 1
index 1 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 2 : 1 in the column C and D is appearing for the first time in their columns hence Paramm1 = 2
index 3 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 4 : 1 in the column A is appearing for the first time in the column hence Paramm1 = 1
index 5 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
I will do idxmax and value_counts
df['Param']=df.idxmax().value_counts().reindex(df.index,fill_value=0)
df
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
You can check for duplicated values, multiply with df and sum:
df['Param'] = df.apply(lambda x: ~x.duplicated()).mul(df).sum(1)
Output:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Assuming these are integers, you can use cumsum() twice to isolate the first occurrence of 1.
df2 = (df.cumsum() > 0).cumsum() == 1
df['Param'] = df2.sum(axis = 1)
print(df)
If df elements are strings, you should first convert them to integers.
df = df.astype(int)
in pandas I have the following data frame:
a b
0 0
1 1
2 1
0 0
1 0
2 1
Now I want to do the following:
Create a new column c, and for each row where a = 0 fill c with 1. Then c should be filled with 1s until the first row after each column fulfilling that, where b = 1 (and here im hanging), so the output should look like this:
a b c
0 0 1
1 1 1
2 1 0
0 0 1
1 0 1
2 1 1
Thanks!
It seems you need:
df['c'] = df.groupby(df.a.eq(0).cumsum())['b'].cumsum().le(1).astype(int)
print (df)
a b c
0 0 0 1
1 1 1 1
2 2 1 0
3 0 0 1
4 1 0 1
5 2 1 1
Detail:
print (df.a.eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
Name: a, dtype: int32
I am a newbie in python. I have a data frame that looks like this:
A B C D E
0 1 0 1 0 1
1 0 1 0 0 1
2 0 1 1 1 0
3 1 0 0 1 0
4 1 0 0 1 1
How can I write a for loop to gather the column names for each row. I expect my result set looks like that:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE
Anyone can help me with that? Thank you!
The dot function is done for that purpose as you want the matrix dot product between your matrix and the vector of column names:
df.dot(df.columns)
Out[5]:
0 ACE
1 BE
2 BCD
3 AD
4 ADE
If your dataframe is numeric, then obtain the boolean matrix first by test your df against 0:
(df!=0).dot(df.columns)
PS: Just assign the result to the new column
df['Result'] = df.dot(df.columns)
df
Out[7]:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE
I have a dataframe df with columns [ShowOnAir, AfterPremier, ID, EverOnAir].
My condition is that
if it is the first element of groupby(df.ID)
then if (df.ShowOnAir ==0 or df.AfterPremier == 0), then EverOnAir = 0
else EverOnAir = 1
I am not sure how to compare the first element of the groupby, with elements of the orignal dataframe df.
would really appreciate if I could get help in it ,
Thank you
You can get a row number for your groups by using cumsum, then you can do your logic on the resulting dataframe:
df = pd.DataFrame([[1],[1],[2],[2],[2]])
df['n']=1
df.groupby(0).cumsum()
n
0 1
1 2
2 1
3 2
4 3
You can first create new column EverOnAir filled 1. Then groupby by ID and apply custom function f, where find first element of columns by iat and fill 0:
print df
ShowOnAir AfterPremier ID
0 0 0 a
1 0 1 a
2 1 1 a
3 1 1 b
4 1 0 b
5 0 0 b
6 0 1 c
7 1 0 c
8 0 0 c
def f(x):
#print x
x['EverOnAir'].iat[0] = np.where((x['ShowOnAir'].iat[0] == 0 ) |
(x['AfterPremier'].iat[0] == 0), 0, 1)
return x
df['EverOnAir'] = 1
print df.groupby('ID').apply(f)
ShowOnAir AfterPremier ID EverOnAir
0 0 0 a 0
1 0 1 a 1
2 1 1 a 1
3 1 1 b 1
4 1 0 b 1
5 0 0 b 1
6 0 1 c 0
7 1 0 c 1
8 0 0 c 1