Concatenate Data -Python - python

I am working with data formatted in a .txt file in the format below:
family1 1 0 0 2 0 2 2 0 0 0 1 0 1 1 0 0 0 0 1 NA NA 4
family1 2 0 0 2 2 1 4 0 0 0 0 0 0 0 0 0 0 0 0 NA NA 4
family1 3 0 0 2 5 1 2 0 0 0 1 1 0 1 1 1 0 0 0 NA NA 2
family2 1 0 0 2 5 2 1 1 1 1 0 0 0 0 0 0 0 0 0 NA NA 3
etc.
where the second column is a member of the family and the other columns are numbers that correspond to traits.
I need to compare the relatives listed in this data set to create an output like this:
family1 1 2 traitnumber traitnumber ...
family1 1 3 traitnumber traitnumber ...
family1 2 3 traitnumber traitnumber ...
where the numbers are the relatives.
I have created a data frame using:
import pandas as pd
data = pd.read_csv('file.txt.', sep=" ", header = None)
print(data)
Can you offer any advice on the most efficient way to concatenate this data into the desired rows? I am having trouble comparing thinking of a way to write code for the different combinations i.e. relative 1 and 2, 1 and 3, and 2 and 3.
Thank you!

You might find combinations from itertools to be helpful.
from itertools import combinations
print([thing for thing in combinations((1,2,3), 2)])
Yields
[(1, 2), (1, 3), (2, 3)]

Building on DragonBobZ comment. You could do something like this using the groupby function of the dataframe to split out the families
import pandas as pd
data = pd.read_csv('file.txt', sep=" ", header = None)
print(data)
from itertools import combinations
grouped_df = data.groupby(0)
for key, item in grouped_df:
print key
current_subgroup = grouped_df.get_group(key)
print current_subgroup
print current_subgroup.shape, "\n"
print([thing for thing in combinations(range(current_subgroup.shape[0]), 2)])
Grabbing the output of the "combinations" line will give you a list of tuples that you can use in conjunction with row indexing to perform the comparisons for the appropriate columns.

Related

Get maximum occurance of one specific value per row with pandas

I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.
Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2
Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2
The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)

pandas: convert a list in a column into separate columns

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1
You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]
One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

How sum columns with same name [duplicate]

I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary

Append count of rows meeting a condition within a group to Pandas dataframe

I know how to append a column counting the number of elements in a group, but I need to do so just for the number within that group that meets a certain condition.
For example, if I have the following data:
import numpy as np
import pandas as pd
columns=['group1', 'value1']
data = np.array([np.arange(5)]*2).T
mydf = pd.DataFrame(data, columns=columns)
mydf.group1 = [0,0,1,1,2]
mydf.value1 = ['P','F',100,10,0]
valueslist={'50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S'}
and my dataframe therefore looks like this:
mydf
group1 value1
0 0 P
1 0 F
2 1 100
3 1 10
4 2 0
I would then want to count the number of rows within each group1 value where value1 is in valuelist.
My desired output is:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
After changing the type of the value1 column to match your valueslist (or the other way around), you can use isin to get a True/False column, and convert that to 1s and 0s with astype(int). Then we can apply an ordinary groupby transform:
In [13]: mydf["value1"] = mydf["value1"].astype(str)
In [14]: mydf["count"] = (mydf["value1"].isin(valueslist).astype(int)
.groupby(mydf["group1"]).transform(sum))
In [15]: mydf
Out[15]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
mydf.value1=mydf.value1.astype(str)
mydf['count']=mydf.group1.map(mydf.groupby('group1').apply(lambda x : sum(x.value1.isin(valueslist))))
mydf
Out[412]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Data input :
valueslist=['50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S']
You can groupby each group1 and then use transform to find the max of whether your values are in the list.
mydf['count'] = mydf.groupby('group1').transform(lambda x: x.astype(str).isin(valueslist).sum())
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Here is one way to do it, albeit a one-liner:
mydf.merge(mydf.groupby('group1').apply(lambda x: len(set(x['value1'].values).intersection(valueslist))).reset_index().rename(columns={0: 'count'}), how='inner', on='group1')
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

Generating new columns as a full-combination of other columns

Could not find similar cases here.
Suppose, i have a DataFrame
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
So it is:
A B C I II
0 2 2 3 1 0
1 2 2 3 0 1
2 1 3 3 0 0
3 2 3 4 1 1
I want to make a full pairwise combination between {A,B,C} and {I,II}, so i get {I-A,I-B,I-C,II-A,II-B,II-C}:
Each of a new column is just an elementwise multiplication of corresponding base columns
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
ATM i dont have any working solution. I'am trying to use loops(not succeding in this), but i hope there's more sufficient way.
It's pretty simple, really. You have two sets of columns that you want to combine pairwise. I won't even bother with permutation tools:
>>> new_df = pd.DataFrame()
>>>
>>> for i in ["I", "II"]:
for a in ["A", "B", "C"]:
new_df[i+"-"+a] = df[i] * df[a]
>>> new_df
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Of course you could obtain the lists of column names as slices off df.columns, or in whatever other way is convenient. E.g. for your example dataframe you could write
>>> for i in df.columns[3:]:
for a in df.columns[:3]:
new_df[i+"-"+a] = df[i] * df[a]
Using loops, you can use this code. It's definitely not the most elegant solution but should work for your purpose. It only requires that you specify the columns that you'd like to use for the pairwise multiplication. It seems to be quite readable though, which is something you may want.
def element_wise_mult(first, second):
element_wise_mult = []
for i, el in enumerate(first):
element_wise_mult.append(el * second[i])
return element_wise_mult
if __name__ == '__main__':
import pandas as pd
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
fs = ['I', 'II']
sc = ['A', 'B', 'C']
series = []
names = []
for i in fs:
for j in sc:
names.append(i + '-' + j)
series.append(pd.Series(element_wise(df[i], df[j]))) # append array creates as a pandas series
print(pd.DataFrame(series, index=names).T) # reconstruct dataframe from the series and names stored
Returns:
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Here is a solution without for loops for your specific example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
cross_vals=np.tile(df[df.columns[:3]].values,(1,2))*np.repeat(df[df.columns[3:]].values,3,axis=1)
cros_cols=np.repeat(df.columns[3:].values,3)+np.array('-')+np.tile(df.columns[:3].values,(1,2))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
Then new_df is
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
You could generalize it to any size as long as the columns A,B,C,... are consecutive and similarly the columns I,II,... are consecutive.
For the general case, if the columns are not necessarily consecutive, you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
let=np.array(['A','B','C'],dtype=object)
num=np.array(['I','II'],dtype=object)
cross_vals=np.tile(df[let].values,(1,len(num)))*np.repeat(df[num].values,len(let),axis=1)
cros_cols=np.repeat(num,len(let))+np.array('-')+np.tile(let,(1,len(num)))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
And the result is the same as above.

Categories