python- pandas- concatenate columns with a loop - python

I have a list of columns that I need to concatenate. An example table would be:
import numpy as np
cats1=['T_JW', 'T_BE', 'T_FI', 'T_DE', 'T_AP', 'T_KI', 'T_HE']
data=np.array([random.sample(range(0,2)*7,7)]*3)
df_=pd.DataFrame(data, columns=cats1)
So I need to get the concatenation of each line (if it's possible with a blank space between each value). I tried:
listaFin=['']*1000
for i in cats1:
lista=list(df_[i])
listaFin=zip(listaFin,lista)
But I get a list of tuples:
listaFin:
[((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1)]
And I need to get something like
[0 0 1 0 1 0 1,
0 0 1 0 1 0 1,
0 0 1 0 1 0 1]
How can I do this only using one loop or less (i don't want to use a double loop)?
Thanks.

I don't think you can have a list of space delimited integers in Python without them being in a string (I might be wrong). Having said that, the answer I have is:
output = []
for i in range(0,df_.shape[0]):
output.append(' '.join(str(x) for x in list(df_.loc[i])))
print(output)
output looks like this:
['1 0 0 0 1 0 1', '1 0 0 0 1 0 1', '1 0 0 0 1 0 1']

Related

How to make graph from table Python?

if i have code like this
import pandas as pd
import random
x = 5
table = []
row = []
for i in range(x):
for j in range(x):
if i == j :
row.append(0)
else :
row.append(random.randint(0,1))
table.append(row)
row = []
df = pd.DataFrame(table)
df
and the output will be there
how to make graph from this table ?
i want the output graph like this [(0,1), (0,2), (1,0), (1,2), (1,4), (2,3), (2,4), (3,0), (3,1), (3,2), (3,4), (4,0)]
IIUC, replace 0 by NA, stack (which drops the NA by default), and convert the index to list:
df.replace(0, pd.NA).stack().index.to_list()
output:
[(0, 3), (0, 4), (1, 0), (1, 2), (1, 3), (2, 0), (2, 1), (4, 0), (4, 3)]
matching input:
0 1 2 3 4
0 0 0 0 1 1
1 1 0 1 1 0
2 1 1 0 0 0
3 0 0 0 0 0
4 1 0 0 1 0

Get log(n, 2) features after dummy encoding

During the dummy encoding (for example OneHotEncoder) we can drop first feature (with param: drop='first'). It helps when we have 3 features. It works because it's enough to have 2 cells to encoding 3 features like: (0, 0), (0, 1), (1, 0); also for 4 features: (0, 0), (0, 1), (1, 0), (1, 1). So I noticed that to encode n-categories it's enough to have math.ceil(log(n, 2)) features. But I can't find function (in sklearn/pandas) that allows to do this. I ask your help.
What you're searching for is bin() function which is a standard built-in in Python.
Suppose you happen to have a simple pandas df:
df = pd.DataFrame({"a":["a","b","c","x","a","c"]})
print(df)
a
0 a
1 b
2 c
3 x
4 a
5 c
Then you may proceed as follows:
df["enc"] = df["a"].astype("category").cat.codes
max_enc_length = len(bin(df["enc"].max())[2:])
df["enc"]=df["enc"].apply(lambda x: bin(x)[2:].zfill(max_enc_length))
df = pd.concat([df["a"], df["enc"].apply(lambda x: pd.Series(list(x)))], axis=1)
print(df)
a 0 1
0 a 0 0
1 b 0 1
2 c 1 0
3 x 1 1
4 a 0 0
5 c 1 0
Note, linear models are out for such type of encoding.

How does one interpret sklearn sparse matrix outputs?

I am trying to produce a bigram word co-occurrence matrix, indicating how many times one word follows another in a corpus.
As a test, I wrote the following (which I gathered from other SE questions):
from sklearn.feature_extraction.text import CountVectorizer
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
bigram_vec = CountVectorizer(ngram_range=(1,2))
X = bigram_vec.fit_transform(test_sent)
Xc = (X.T * X)
print Xc
This should give the correct output. The matrix Xc is output like so:
(0, 0) 1
(1, 1) 2
(2, 2) 2
(3, 3) 1
(4, 4) 1
I have no idea how to interpret this. I attempted to make it dense to help with my interpretation using Xc.todense(), which got this:
[[1 0 0 0 0]
[0 2 0 0 0]
[0 0 2 0 0]
[0 0 0 1 0]
[0 0 0 0 1]]
Neither of these give the correct word co-occurrence matrix showing one how many times row follows column.
Could someone please explain how I can interpret/use the output? Why is it like that?
Addition to question
Here is another possible output with a different example using ngram_range=(2,2):
from sklearn.feature_extraction.text import CountVectorizer
test_sent = ['hello biggest awesome biggest biggest awesome today lively splendid awesome today']
bigram_vec = CountVectorizer(ngram_range=(2,2))
X = bigram_vec.fit_transform(test_sent)
print bigram_vec.get_feature_names()
Xc = (X.T * X)
print Xc
print ' '
print Xc.todense()
(4, 0) 1
(2, 0) 2
(0, 0) 1
(3, 0) 1
(1, 0) 2
(7, 0) 1
(5, 0) 1
(6, 0) 1
(4, 1) 2
(2, 1) 4
(0, 1) 2
(3, 1) 2
(1, 1) 4
(7, 1) 2
(5, 1) 2
(6, 1) 2
(4, 2) 2
(2, 2) 4
(0, 2) 2
(3, 2) 2
(1, 2) 4
(7, 2) 2
(5, 2) 2
(6, 2) 2
(4, 3) 1
: :
(6, 4) 1
(4, 5) 1
(2, 5) 2
(0, 5) 1
(3, 5) 1
(1, 5) 2
(7, 5) 1
(5, 5) 1
(6, 5) 1
(4, 6) 1
(2, 6) 2
(0, 6) 1
(3, 6) 1
(1, 6) 2
(7, 6) 1
(5, 6) 1
(6, 6) 1
(4, 7) 1
(2, 7) 2
(0, 7) 1
(3, 7) 1
(1, 7) 2
(7, 7) 1
(5, 7) 1
(6, 7) 1
[[1 2 2 1 1 1 1 1]
[2 4 4 2 2 2 2 2]
[2 4 4 2 2 2 2 2]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]
[1 2 2 1 1 1 1 1]]
This one seems to tokenize by bigrams, since calling bigram_vec.get_feature_names() gives
[u'awesome biggest', u'awesome today', u'biggest awesome', u'biggest biggest', u'hello biggest', u'lively splendid', u'splendid awesome', u'today lively']
Some help interpretting this would be great. It's a symmetric matrix so I'm thinking it might just be number of occurrences?
First you need to check out the feature names which the CountVectorizer is using.
Do this:
bigram_vec.get_feature_names()
# Out: [u'am', u'dont', u'hello', u'to', u'want']
You see that the word "i" is not present. That's because the default tokenizer uses a pattern:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'. The default regexp select tokens of 2 or more
alphanumeric characters (punctuation is completely ignored and always
treated as a token separator).
And the actual output of the X should be interpreted as:
[u'am', u'dont', u'hello', u'to', u'want']
'hello' [[ 0 0 1 0 0]
'i' [ 0 0 0 0 0]
'am' [ 1 0 0 0 0]
'hello' [ 0 0 1 0 0]
'i' [ 0 0 0 0 0]
'dont' [ 0 1 0 0 0]
'want' [ 0 0 0 0 1]
'to' [ 0 0 0 1 0]
'i' [ 0 0 0 0 0]
'dont' [ 0 1 0 0 0]]
Now when you do X.T * X this should be interpreted as:
u'am' u'dont' u'hello' u'to' u'want'
u'am' [[1 0 0 0 0]
u'dont' [0 2 0 0 0]
u'hello' [0 0 2 0 0]
u'to' [0 0 0 1 0]
u'want' [0 0 0 0 1]]
If you are expecting anything else, then you should add the details in the question.

Pandas: Conditionally Selecting Columns to perform Calculation based on Header of another Column

My dataframe looks like this:
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (1, 9) (1, 10) (1, 11) ... 2 3 4 5 6 7 8 9 10 11
0 0 1 0 1 1 1 1 0 1 0 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
1 0 0 0 0 0 0 0 0 0 0 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
2 1 0 0 1 0 1 1 0 0 0 ... 0.481291 0.593353 0.271028 0.498949 0.588807 0.641602 0.901779 0.424495 0.303309 0.669657
3 1 1 0 1 0 1 1 0 0 1 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
4 0 0 0 1 1 1 1 1 1 1 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
where i have column headers with a tuple like (1, 2) and column headers that are a single element, like 1. I want to perform a calculation on the tuple columns based on the columns that have the elements of that tuple. For example, with the tuple (1, 2), I want to retrieve the columns 1 and 2, multiply them together, then subtract the result from the column (1, 2).
The solution that I thought of was to create (55) new columns that perform the first calculation from the columns that contain only a single element (e.g. 1 or 2), and then do some sort of identity match using the .where() and all() statements. However, this seems rather computationally inefficient since I'd be making a whole other set of data, rather than performing the calculation directly on the tuple column. How would I go about this?
Not sure if this is faster, but here's a solution without needing where()/all()
import pandas as pd
# create some sample data
arr = [[1, 2, 3, 4, 5, 6, 7],
[7, 6, 5, 4, 3, 2, 1]]
df = pd.DataFrame(arr, columns=[('a', 'b'), ('c','d'), ('a', 'd'), 'a', 'b', 'c', 'd'])
# get all tuple headers
tuple_columns = [col for col in df.columns if isinstance(col, tuple)]
# put the results into a list of series and concat into a DataFrame
results = pd.concat([df[col] - df[col[0]] * df[col[1]] for col in tuple_columns], axis=1)
# rename the columns
results.columns = tuple_columns

Slice pandas series with elements not in the index

I have a pandas series indexed by tuples, like this:
from pandas import Series
s = Series({(0, 0): 1, (0, 1): 2, (0, 3): 3, (1, 0): 1, (1, 2): 4, (3, 0): 5})
I want to slice such a series by using indexes that are also tuples (using lexicographic ordering), but not necessarily in the index. Slicing seems to work when I pass an index that is on the series:
s[:(1,0)]
(0, 0) 1
(0, 1) 2
(0, 3) 3
(1, 0) 1
dtype: int64
but if I try slicing by an index which is not on the series there is an error:
s[:(1,1)]
...
ValueError: Index(...) must be called with a collection of some kind, 0 was passed
Ideally I'd like to get the series elements indexed by (0, 0), (0, 1), (0, 3), (1, 0), similar to what happens when slicing using dates in TimeSeries. Is there a simple way to achieve this?
This works if you have a MultiIndex rather than an index of tuples:
In [11]: s.index = pd.MultiIndex.from_tuples(s.index)
In [12]: s
Out[12]:
0 0 1
1 2
3 3
1 0 1
2 4
3 0 5
dtype: int64
In [13]: s[:(1,1)]
Out[13]:
0 0 1
1 2
3 3
1 0 1
dtype: int64
In a previous edit I had suggested this could be a bug, and had created an awful hack...

Categories