I came across the concept of flags in python on some occasions, for example in wxPython. An example is the initialization of a frame object.
The attributes that are passed to "style".
frame = wx.Frame(None, style=wx.MAXIMIZE_BOX | wx.RESIZE_BORDER | wx.SYSTEM_MENU | wx.CAPTION | wx.CLOSE_BOX)
I don't really understand the concept of flags. I haven't even found a solid explanation what exactly the term "flag" means in Python. How are all these attributes passed to one variable?
The only thing i can think of is that the "|" character is used as a boolean operator, but in that case wouldn't all the attributes passed to style just evaluate to a single boolean expression?
What is usually meant with flags in this sense are bits in a single integer value. | is the ususal bit-or operator.
Let's say wx.MAXIMIZE_BOX=8 and wx.RESIZE_BORDER=4, if you or them together you get 12. In this case you can actually use + operator instead of |.
Try printing the constants print(wx.MAXIMIZE_BOX) etc. and you may get a better understanding.
Flags are not unique to Python; the are a concept used in many languages. They build on the concepts of bits and bytes, where computer memory stores information using, essentially, a huge number of flags. Those flags are bits, they either are off (value 0) or on (value 1), even though you usually access the computer memory in groups of at least 8 of such flags (bytes, and for larger groups, words of a multiple of 8, specific to the computer architecture).
Integer numbers are an easy and common representation of the information stored in bytes; a single byte can store any integer number between 0 and 255, and with more bytes you can represent bigger integers. But those integers still consist of bits that are either on or off, and so you can use those as switches to enable or disable features. You pass in specific integer values with specific bits enabled or disabled to switch features on and off.
So a byte consists of 8 flags (bits), and enabling one of these means you have 8 different integers; 1, 2, 4, 8, 16, 32, 64 and 128, and you can pass a combination of those numbers to a library like wxPython to set different options. For multi-byte integers, the numbers just go up by doubling.
But you a) don't want to remember what each number means, and b) need a method of combining them into a single integer number to pass on.
The | operator does the latter, and the wx.MAXIMIZE_BOX, wx.RESIZE_BORDER, etc names are just symbolic constants for the integer values, set by the wxWidget project in various C header files, and summarised in wx/toplevel.h and wx/defs.h:
/*
Summary of the bits used (some of them are defined in wx/frame.h and
wx/dialog.h and not here):
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | \_ wxCENTRE
| | | | | | | | | | | | | | \____ wxFRAME_NO_TASKBAR
| | | | | | | | | | | | | \_______ wxFRAME_TOOL_WINDOW
| | | | | | | | | | | | \__________ wxFRAME_FLOAT_ON_PARENT
| | | | | | | | | | | \_____________ wxFRAME_SHAPED
| | | | | | | | | | \________________ wxDIALOG_NO_PARENT
| | | | | | | | | \___________________ wxRESIZE_BORDER
| | | | | | | | \______________________ wxTINY_CAPTION_VERT
| | | | | | | \_________________________
| | | | | | \____________________________ wxMAXIMIZE_BOX
| | | | | \_______________________________ wxMINIMIZE_BOX
| | | | \__________________________________ wxSYSTEM_MENU
| | | \_____________________________________ wxCLOSE_BOX
| | \________________________________________ wxMAXIMIZE
| \___________________________________________ wxMINIMIZE
\______________________________________________ wxSTAY_ON_TOP
...
*/
and
/*
Summary of the bits used by various styles.
High word, containing styles which can be used with many windows:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|31|30|29|28|27|26|25|24|23|22|21|20|19|18|17|16|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | \_ wxFULL_REPAINT_ON_RESIZE
| | | | | | | | | | | | | | \____ wxPOPUP_WINDOW
| | | | | | | | | | | | | \_______ wxWANTS_CHARS
| | | | | | | | | | | | \__________ wxTAB_TRAVERSAL
| | | | | | | | | | | \_____________ wxTRANSPARENT_WINDOW
| | | | | | | | | | \________________ wxBORDER_NONE
| | | | | | | | | \___________________ wxCLIP_CHILDREN
| | | | | | | | \______________________ wxALWAYS_SHOW_SB
| | | | | | | \_________________________ wxBORDER_STATIC
| | | | | | \____________________________ wxBORDER_SIMPLE
| | | | | \_______________________________ wxBORDER_RAISED
| | | | \__________________________________ wxBORDER_SUNKEN
| | | \_____________________________________ wxBORDER_{DOUBLE,THEME}
| | \________________________________________ wxCAPTION/wxCLIP_SIBLINGS
| \___________________________________________ wxHSCROLL
\______________________________________________ wxVSCROLL
...
*/
The | operator is the bitwise OR operator; it combines the bits of two integers, each matching bit is paired up and turned into an output bit according to the boolean rules for OR. When you do this for those integer constants, you get a new integer number with multiple flags enabled.
So the expression
wx.MAXIMIZE_BOX | wx.RESIZE_BORDER | wx.SYSTEM_MENU | wx.CAPTION | wx.CLOSE_BOX
gives you an integer number with the bits numbers 9, 6, 11, 29, and 12 set; here I used '0' and '1' strings to represent the bits and int(..., 2) to interpret a sequence of those strings as a single integer number in binary notation:
>>> fourbytes = ['0'] * 32
>>> fourbytes[9] = '1'
>>> fourbytes[6] = '1'
>>> fourbytes[11] = '1'
>>> fourbytes[29] = '1'
>>> fourbytes[12] = '1'
>>> ''.join(fourbytes)
'00000010010110000000000000000100'
>>> int(''.join(fourbytes), 2)
39321604
On the receiving end, you can use the & bitwise AND operator to test if a specific flag is set; that return 0 if the flag is not set, or the same integer as assigned to the flag constant if the flag bit had been set. In both C and in Python, a non-zero number is true in a boolean test, so testing for a specific flag is usually done with:
if ( style & wxMAXIMIZE_BOX ) {
for determining that a specific flag is set, or
if ( !(style & wxBORDER_NONE) )
to test for the opposite.
It is a boolean operator - not logical one, but bitwise one. wx.MAXIMIZE_BOX and the rest are typically integers that are powers of two - 1, 2, 4, 8, 16... which makes it so that only one bit in them is 1, all the rest of them are 0. When you apply bitwise OR (x | y) to such integers, the end effect is they combine together: 2 | 8 (0b00000010 | 0b00001000) becomes 10 (0b00001010). They can be pried apart later using the bitwise AND (x & y) operator, also calling a masking operator: 10 & 8 > 0 will be true because the bit corresponding to 8 is turned on.
Related
I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large
I have 2 views as below:
experiments:
select * from experiments;
+--------+--------------------+-----------------+
| exp_id | exp_properties | value |
+--------+--------------------+-----------------+
| 1 | indicator:chemical | phenolphthalein |
| 1 | base | NaOH |
| 1 | acid | HCl |
| 1 | exp_type | titration |
| 1 | indicator:color | faint_pink |
+--------+--------------------+-----------------+
calculations:
select * from calculations;
+--------+------------------------+--------------+
| exp_id | exp_report | value |
+--------+------------------------+--------------+
| 1 | molarity:base | 0.500000000 |
| 1 | volume:acid:in_ML | 23.120000000 |
| 1 | volume:base:in_ML | 5.430000000 |
| 1 | moles:H | 0.012500000 |
| 1 | moles:OH | 0.012500000 |
| 1 | molarity:acid | 0.250000000 |
+--------+------------------------+--------------+
I managed to pivot each of these views individually as below:
experiments_pivot:
+-------+--------------------+------+------+-----------+----------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|
+-------+--------------------+------+------+-----------+----------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink |
+------+---------------------+------+------+-----------+----------------+
calculations_pivot:
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
|exp_id | molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
| 1 | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------------------------------------------+
My question is how to get these two pivot results as a single row? Desired result is as below:
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
Database Used: Mysql
Important Note: Each of these views can have increasing number of rows. Hence I considered "dynamic pivoting" for each of the view individually.
For reference -- Below is a prepared statement I used to pivot experiments in MySQL(and a similar statement to pivot the other view as well):
set #sql = Null;
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'MAX(IF(exp_properties = ''',
exp_properties,
''', value, NULL)) AS ',
concat("`",exp_properties, "`")
)
)into #sql
from experiments;
set #sql = concat(
'select exp_id, ',
#sql,
' from experiment group by exp_id'
);
prepare stmt from #sql;
execute stmt;
Is there a more efficient way to parse dependency version requirement specification referenced here and extract the version in python. (https://maven.apache.org/pom.html#Dependency_Version_Requirement_Specification)
This is what I've got so far and it feels not the most efficient way. and I guess it's buggy.
for version in versions:
pattern = re.findall("\(,[.0-9]+|[.0-9]+\)|[.0-9]+|\([.0-9]+", version)
if pattern:
for matches in pattern:
if ([match for match in re.findall("[.0-9]+\)", matches)]):
# this is the less pattern
pattern_version = "<" + str(matches.decode('utf8')[:-1])
elif ([match for match in re.findall("\(,[.0-9]+", matches)]):
pattern_version = ">" + str(matches.decode('utf8')[2:])
elif ([match for match in re.findall("\([.0-9]+", matches)]):
pattern_version = ">" + str(matches.decode('utf8')[1:])
else:
pattern_version = str(matches.decode('utf8'))
expected output would be:
(,1.0],[1.2,) parse to: x <= 1.0 or x >= 1.2
(?P<eq>^[\d.]+$)|(?:^\[(?P<heq>[\d.]+)\]$)|(?:(?P<or>(?<=\]|\)),(?=\[|\())|,|(?:(?<=,)(?:(?P<lte>[\d.]+)\]|(?P<lt>[\d.]+)\)))|(?:(?:\[(?P<gte>[\d.]+)|\((?P<gt>[\d.]+))(?=,)))+
Regex will proceed to match version in this order:
First try to match "Soft" requirement using:
(?P<eq>^[\d.]+$)
Then try to match "Hard" requirement using:
(?:^\[(?P<heq>[\d.]+)\]$)
Otherwise try to match ranges in this order:
First determine if this is a multiple set using:
(?:(?P<or>(?<=\]|\)),(?=\[|\())
which will only match the comma separating sets.
Then try to match the comma separating ranges within the same set using:
,.
Then proceeds to match the actual ranges:
start matching the upper bound value using
(?:(?<=,)(?:(?P<lte>[\d.]+)\]|(?P<lt>[\d.]+)\)))
then the lower bound value using
(?:(?:\[(?P<gte>[\d.]+)|\((?P<gt>[\d.]+))(?=,))
The result of versions included in the specification will be:
| version | eq | heq | gte | gt | or | lte | lt |
| ------------- | --- | --- | --- | --- | -- | --- | --- |
| 1.0 | 1.0 | | | | | | |
| [1.0] | | 1.0 | | | | | |
| (,1.0] | | | | | | 1.0 | |
| [1.2,1.3] | | | 1.2 | | | 1.3 | |
| [1.0,2.0) | | | 1.0 | | | | 2.0 |
| [1.5,) | | | 1.5 | | | | |
| (,1.0],[1.2,) | | | 1.2 | | , | 1.0 | |
| (,1.1),(1.1,) | | | | 1.1 | , | | 1.1 |
I have a decision tree output in a 'text' format which is very hard to read and interpret. There are ton of pipes and indentation to follow the tree/nodes/leaf. I was wondering if there are tools out there where I can feed in a decision tree like below and get a tree diagram like Weka, Python, ...etc does?
Since my decision tree is very large, below is the sample/partial decision to give an idea of my text decision tree. Thanks a bunch!
"bio" <= 0.5:
| "ml" <= 0.5:
| | "algorithm" <= 0.5:
| | | "bioscience" <= 0.5:
| | | | "microbial" <= 0.5:
| | | | | "assembly" <= 0.5:
| | | | | | "nano-tech" <= 0.5:
| | | | | | | "smith" <= 0.5:
| | | | | | | | "neurons" <= 0.5:
| | | | | | | | | "process" <= 1.5:
| | | | | | | | | | "program" <= 1.5:
| | | | | | | | | | | "mammal" <= 1.0:
| | | | | | | | | | | | "lab" <= 0.5:
| | | | | | | | | | | | | "human-machine" <= 1.5:
| | | | | | | | | | | | | | "tech" <= 0.5:
| | | | | | | | | | | | | | | "smith" <= 0.5:
I'm not aware of any tool to interpret that format so I think you're going to have to write something, either to interpret the text format or to retrieve the tree structure using the DecisionTree class in MALLET's Java API.
Interpreting the text in Python shouldn't be too hard: for example, if
line = '| | | | | "assembly" <= 0.5:'
then you can get the indent level, the predictor name and the split point with
parts = line.split('"')
indent = parts[0].count('| ')
predictor = parts[1]
splitpoint = float(parts[2][-1-parts[2].rfind(' '):-1])
To create graphical output, I would use GraphViz. There are Python APIs for it, but it's simple enough to build a file in its text-based dot format and create a graphic from it with the dot command. For example, the file for a simple tree might look like
digraph MyTree {
Node_1 [label="Predictor1"]
Node_1 -> Node_2 [label="< 0.335"]
Node_1 -> Node_3 [label=">= 0.335"]
Node_2 [label="Predictor2"]
Node_2 -> Node_4 [label="< 1.42"]
Node_2 -> Node_5 [label=">= 1.42"]
Node_3 [label="Class1
(p=0.897, n=26)", shape=box,style=filled,color=lightgray]
Node_4 [label="Class2
(p=0.993, n=17)", shape=box,style=filled,color=lightgray]
Node_5 [label="Class3
(p=0.762, n=33)", shape=box,style=filled,color=lightgray]
}
and the resulting output from dot
UPDATE(04/20/17):
I am using Apache Spark 2.1.0 and I will be using Python.
I have narrowed down the problem and hopefully someone more knowledgeable with Spark can answer. I need to create an RDD of tuples from the header of the values.csv file:
values.csv (main collected data, very large):
+--------+---+---+---+---+---+----+
| ID | 1 | 2 | 3 | 4 | 9 | 11 |
+--------+---+---+---+---+---+----+
| | | | | | | |
| abc123 | 1 | 2 | 3 | 1 | 0 | 1 |
| | | | | | | |
| aewe23 | 4 | 5 | 6 | 1 | 0 | 2 |
| | | | | | | |
| ad2123 | 7 | 8 | 9 | 1 | 0 | 3 |
+--------+---+---+---+---+---+----+
output (RDD):
+----------+----------+----------+----------+----------+----------+----------+
| abc123 | (1;1) | (2;2) | (3;3) | (4;1) | (9;0) | (11;1) |
| | | | | | | |
| aewe23 | (1;4) | (2;5) | (3;6) | (4;1) | (9;0) | (11;2) |
| | | | | | | |
| ad2123 | (1;7) | (2;8) | (3;9) | (4;1) | (9;0) | (11;3) |
+----------+----------+----------+----------+----------+----------+----------+
What happened was I paired each value with the column name of that value in the format:
(column_number, value)
raw format (if you are interested in working with it):
id,1,2,3,4,9,11
abc123,1,2,3,1,0,1
aewe23,4,5,6,1,0,2
ad2123,7,8,9,1,0,3
The Problem:
The example values.csv file contains only a few columns, but in the actual file there are thousands of columns. I can extract the header and broadcast it to every node in the distributed environment, but I am not sure if that is the most efficient way to solve the problem. Is it possible to achieve the output with a parallelized header?
I think you can achieve the solution using PySpark Dataframe too. However, my solution is not optimal yet. I use split to get the new column name and corresponding columns to do sum. This depends on how large is your key_list. If it's too large, this might not work will because you have to load key_list on memory (using collect).
import pandas as pd
import pyspark.sql.functions as func
# example data
values = spark.createDataFrame(pd.DataFrame([['abc123', 1, 2, 3, 1, 0, 1],
['aewe23', 4, 5, 6, 1, 0, 2],
['ad2123', 7, 8, 9, 1, 0, 3]],
columns=['id', '1', '2', '3','4','9','11']))
key_list = spark.createDataFrame(pd.DataFrame([['a', '1'],
['b','2;4'],
['c','3;9;11']],
columns=['key','cols']))
# use values = spark.read.csv(path_to_csv, header=True) for your data
key_list_df = key_list.select('key', func.split('cols', ';').alias('col'))
key_list_rdd = key_list_df.rdd.collect()
for row in key_list_rdd:
values = values.withColumn(row.key, sum(values[c] for c in row.col if c in values.columns))
keys = [row.key for row in key_list_rdd]
output_df = values.select(keys)
Output
output_df.show(n=3)
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 4|
| 4| 6| 8|
| 7| 9| 12|
+---+---+---+