How do I remove header row from a dataframe? [duplicate] - python

This question already has an answer here:
How to print a dataframe without the name of the columns and the index
(1 answer)
Closed 13 days ago.
0 1 2 3 4 5 6 7 8 9 event
0 False True False True True False True True True False 1
1 False True False True True True True True False False 2
2 True True False False False True False False True False 0
3 True True True True False False True True True False 2
4 True False True False False False True True True True 0
5 True False True False False False True False True True 1
I have to remove 0 to event
I expect something like this
0 False True False True True False True True True False
1 False True False True True True True True False False
2 True True False False False True False False True False
3 True True True True False False True True True False
4 True False True False False False True True True True
5 True False True False False False True False True True

It makes no sense to delete column names - how would you refer to the columns. But you can DISPLAY without column names:
print(df.to_string(header=False))
to_string actually produces a formatted string which can be re-formatted if required.

Related

Same sentence different vectors with sent2vec vectorizer

Hi im trying some code where i have some sample data stored in mongodb.
Im querying the data appending into list and then i have a sentence which i need to find the vectors and find cosine similarity but the issue is when i calculate vectors for same sentences the vectors are coming up different for the one calculated from the mongo db.
from scipy import spatial
from sent2vec.vectorizer import Vectorizer
from pymongo import MongoClient
# Connecting to local mongo
client = MongoClient('localhost', 27017)
database = client['test']
collection = database['test']
cursor = collection.find({})
#empty list to append the data from mongo
info = []
#test data
info1 = ['hello im john',]
info2 = ['hello im john',]
for document in cursor:
info.append(str(document["title"]))
# break
print(info) #check the data
print(info1)
vectorizer = Vectorizer()
vectorizer.run(info)
vectors_info = vectorizer.vectors #mongo data vectors
vectorizer = Vectorizer()
vectorizer.run(info1)
vectors_info1 = vectorizer.vectors #test data vectors
#comparing the vectors
print(vectors_info[0]==vectors_info1[0])
vectorizer = Vectorizer()
vectorizer.run(info2)
vectors_info2 = vectorizer.vectors
#comparing same text again
print(vectors_info1[0]==vectors_info2[0])
The output is as follow
python3 test.py
['hello im john', 'hello im tez']
['hello im john']
Initializing Bert distilbert-base-uncased
Vectorization done on cpu
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Initializing Bert distilbert-base-uncased
Vectorization done on cpu
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False]
Initializing Bert distilbert-base-uncased
Vectorization done on cpu
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True]
for print(vectors_info[0]==vectors_info1[0]) all are false where as for print(vectors_info1[0]==vectors_info2[0]) its true.
tried to check if both the strings are same with info[0]==info[1] it returns true.
It appears that vectors_info should be the results of running the Vectorizer on a list with 2 string items:
['hello im john', 'hello im tez']
It appears that vectors_info1 and vectors_info2 should be the results of running the Vectorizer on a list with 1 string item:
['hello im john']
Thus I'd expect to get one vector reported back from Vectorizer for the 1st 2-item list, and some other vector for the two times Vectorizer is asked about the 2nd 1-item list.
That seems to be exactly what your output shows. To get the same vector back, you would need to pass the same input.
(You may also want to double-check whether sent2vec.Vectorizer expects a string, or a list-of-words, or something else – and that you are passing it the form it needs.)

how count the most True that are followed by each other?

I have a true false in each row. how I can count Trues whose are followed each other and select its max??
example:
True,True,True,True,False,True,True,True,True,True,True,False,True
answer:
4,6,1 ----> the answer is 6!
Well I have a data frame and I have to do that for each row:
diff0 diff1 diff2 diff3 diff4 diff5 diff6 diff7 diff8 diff9
0 True True True False True False False True False True
1 False False False False False False False False False False
2 False False False False False False False False False False
3 True False False False True False True False False False
4 True False True False False True False False False False
For an example the first row is 3.
Try:
lst = [True,True,True,True,False,True,True,True,True,True,True,False,True]
from itertools import groupby
>>> max(sum(1 for i in c if i) for n, c in groupby(lst))
6
Edit: To implement this for each row of your DataFrame, you can do:
df["seq"] = df.apply(lambda x: max(sum(1 for i in c if i) for n, c in groupby(x)), axis=1)
>>> df
diff0 diff1 diff2 diff3 diff4 diff5 diff6 diff7 diff8 diff9 seq
0 True True True False True False False True False True 3
1 False False False False False False False False False False 0
2 False False False False False False False False False False 0
3 True False False False True False True False False False 1
4 True False True False False True False False False False 1
groupby() where there's a change in sequence
get length of each sequence using count()
then its the max() length sequence
import pandas as pd
df = pd.DataFrame({"seq":[True,True,True,True,False,True,True,True,True,True,True,False,True]})
df.groupby((df["seq"]!=df["seq"].shift()).cumsum()).count().max()
output
seq 6
dtype: int64
row by row instead of single column
changed agg() function from count() to sum() so only True is considered
import io
df = pd.read_csv(io.StringIO(""" diff0 diff1 diff2 diff3 diff4 diff5 diff6 diff7 diff8 diff9
0 True True True False True False False True False True
1 False False False False False False False False False False
2 False False False False False False False False False False
3 True False False False True False True False False False
4 True False True False False True False False False False"""), sep="\s+")
dft = df.T
df["maxslen"] = [dft.groupby((dft[c]!=dft[c].shift()).cumsum()).sum().max()[c] for c in dft.columns]
output
diff0
diff1
diff2
diff3
diff4
diff5
diff6
diff7
diff8
diff9
maxslen
True
True
True
False
True
False
False
True
False
True
3
False
False
False
False
False
False
False
False
False
False
0
False
False
False
False
False
False
False
False
False
False
0
True
False
False
False
True
False
True
False
False
False
1
True
False
True
False
False
True
False
False
False
False
1
you can use groupby from itertools
from itertools import groupby
max(sum(1 for v in vals) for k, vals in groupby(bools) if k)

How to change the first occurrence of 'True' in a row to false in pandas

I'm trying to change the first instance of True to False in my DataFrame dependent on row:
A B C
Number
1 True True True
2 False True True
3 False False True
A B C
Number
1 False True True
2 False False True
3 False False False
Every time I try using the for index, row in target_df.iterrows(): line it ends up never finding any 'True' when I look through the row.
Thanks in advance!
You can use the cumulative sum of the Boolean values (False corresponds to 0; True to 1) for each row, along with DataFrame.mask():
>>> condition = df.cumsum(axis=1) == 1
>>> df.mask(condition, False)
a b c
0 False True True
1 False False True
2 False False False
df.mask(self, cond, other=nan)
Return an object of same shape as self and whose corresponding entries
are from self where cond is False and otherwise are from other.
In this case, condition is False everywhere except the points at which you want to switch True -> False:
>>> condition
a b c
0 True False False
1 False True False
2 False False True
One other option would be to use NumPy:
>>> row, col = np.where(df.cumsum(axis=1) == 1)
>>> df.values[row, col] = False
>>> df
a b c
0 False True True
1 False False True
2 False False False

pandas+xlsx: format cells based on another dataframe

I have a pivot table from a dataframe:
pv=testdata.pivot(index='dose',columns='el_num',values='value').reindex(index=doseann)
el_num 1 2 3 4 5 6 7 8 9 10 11
dose
100.0 7.07460 6.37422 19.8883 18.6835 16.5359 59.8294 28.5587 14.18910 39.5265 4.33896 38.0297
11931.0 6.41105 8.27059 19.0014 18.6988 16.4000 59.1123 29.4836 13.25030 36.2842 5.89428 37.9752
25079.0 6.82894 8.11478 19.8956 18.8933 15.8732 58.6548 29.8440 13.25930 36.7238 7.37476 39.1368
49640.0 7.20882 8.17981 19.3958 18.0241 15.3036 58.6676 29.9847 12.50980 37.5594 7.81891 38.7749
71545.0 9.57559 11.55590 15.4280 15.8461 13.5970 59.9049 27.4346 8.38379 40.9102 7.78858 38.5024
84303.0 9.69782 11.00110 16.4352 14.9416 13.6581 59.9323 26.3975 9.74285 40.3733 7.85947 38.5113
101415.0 10.60720 10.36910 16.3399 16.9584 13.1570 60.1249 27.9201 11.02400 39.6205 7.64924 39.0897
150913.0 10.70750 10.07470 17.9623 16.1063 13.2890 59.9274 27.7685 11.94690 39.0937 8.43550 39.5281
169885.0 10.39460 0.00000 16.9633 14.7942 13.8830 58.9495 27.9250 12.58740 38.8587 8.10606 38.8391
200463.0 9.59026 9.26161 18.0652 15.2096 13.0975 59.1136 27.8377 11.90810 40.4693 8.51281 39.2943
24.0 9.45291 9.27879 17.9021 16.5391 13.4601 58.9314 27.3388 10.94170 39.0885 8.77127 38.4680
192.0 6.14907 6.94374 19.6765 12.5670 15.6754 56.5163 28.8796 11.78300 36.6076 6.21283 38.8232
Here is another pivot table with logical values:
fl=testdata.pivot(index='dose',columns='el_num',values='fail').reindex(index=doseann)
el_num 1 2 3 4 5 6 7 8 9 10 11
dose
100.0 False False False False False True False False True False True
11931.0 False False False False False True False False True False True
25079.0 False False False False False True False False True False True
49640.0 False False False False False True False False True False True
71545.0 False False False False False True False False True False True
84303.0 False False False False False True False False True False True
101415.0 False False False False False True False False True False True
150913.0 False False False False False True False False True False True
169885.0 False False False False False True False False True False True
200463.0 False False False False False True False False True False True
24.0 False False False False False True False False True False True
192.0 False False False False False True False False True False True
It is stored to Excel:
doc=pd.ExcelWriter('tests.xlsx',engine='xlsxwriter')
pv2=pd.DataFrame(pv)
pv2.to_excel(doc,sheet_name='Sheet1')
I need to write it to xlsx file and set cell color according to the second pivot table, i.e. to set cell color as 75% gray if according value in fl is True. How can I do it?
If you store both dataframes in the Excel workbook you can use conditional formatting to highlight the cells in one region based on the values in another region. See also Adding Conditional Formatting to Dataframe output.
If you only want to add the values dataframe then I would suggest not using pd.ExcelWriter() directly and using XlsxWriter formatting.

Pandas match multi-column patterns

I have a dataframe consisting of boolean values. I'd like to match certain multi-column patterns in the dataframe. The pattern would look like:
bar foo
0 False True
1 True False
And the expected output would look like:
foo bar pattern
0 True False False
1 True False False
2 True False True
3 False True False
4 False True False
5 False True False
6 False False False
7 False False False
8 False False False
9 False True False
10 False True False
11 False True False
12 False True False
13 False True False
14 False True False
15 False True False
16 True False False
17 True False False
18 True False True
19 False True False
20 False True False
21 False True False
22 True False True
23 False True False
24 False True False
25 False True False
I came up with my own implementation, but I guess there should be a better one.
def matcher(df, pattern):
def aggregator(pattern):
"""Returns a dict of columnswith their aggregator function
which is the partially applied inner in this case"""
def inner(col, window):
return (window == pattern[col]).all()
return {col: partial(inner, col) for col in pattern.columns}
aggregated = (df
# Feed the chunks to aggregator in `len(pattern)` sized windows
.rolling(len(pattern))
.aggregate(aggregator(pattern))
# I'd like it to return True at the beginning of the match
.shift(-len(pattern) + 1)
# rows consisting of nan return true to `.all()`
.fillna(False))
ret = [row.all() for _, row in aggregated.iterrows()]
return pd.Series(ret)
My biggest concern is handling nan values, and the lack of wildcard support (in order to support not necessarily box-shaped patterns).
Any suggestions?
If pd.concat() is not too expensive to you, below code will work quite well with efficiency because there is no loop and no nested function.
print(df) # Original data without 'pattern' column.
df_wide = pd.concat([df, df.shift(-1)], axis=1)
df_wide.columns = ['foo0', 'bar0', 'foo-1', 'bar-1']
pat = ((df_wide['foo0'] == True) & (df_wide['bar-1'] == True)) & \
((df_wide['bar0'] == False) & (df_wide['foo-1'] == False))
df['pattern'] = False
df.loc[df_wide[pat].index, 'pattern'] = True
print(df) # Result data with 'pattern' column.
# Original data without 'pattern' column.
foo bar
0 True False
1 True False
2 True False
3 False True
4 False True
5 False True
...
# Result data with 'pattern' column.
foo bar pattern
0 True False False
1 True False False
2 True False True
3 False True False
4 False True False
5 False True False
6 False False False
7 False False False
8 False False False
9 False True False
10 False True False
11 False True False
12 False True False
13 False True False
14 False True False
15 False True False
16 True False False
17 True False False
18 True False True
19 False True False
20 False True False
21 False True False
22 True False True
23 False True False
24 False True False
25 False True False
Suppose df1 is your patten df and df2 is your value df, you can use apply to check the pattern. For every row, we get the current row and the next row and then compare the 2*2 array with df1 element else and check if all elements are the same.
df2.apply(lambda x: (df2[['foo','bar']].iloc[x.name:x.name+2].values\
==df1[['foo','bar']].values).all(),axis=1)
Out[213]:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 True
19 False
20 False
21 False
22 True
23 False
24 False
25 False
dtype: bool

Categories