Recommendation on selection of required fields using f expressions in pydatatable dataframe - python

I have created a datatable frame as,
DT_EX = dt.Frame({'sales':[103.07, 47.28, 162.15, 84.47, 44.97, 46.97, 34.99, 9.99, 29.99, 64.98],
'quantity':[6, 2, 8, 3, 3, 3, 1, 1, 1, 2],
'customer_lifecycle_status':['Lead','First time buyer','Active customer','Defecting customer','
'Lead','First time buyer','Lead','Lead','Lead','Lead']})
Now I'm trying to select only 2 fields from the datatable as,
DT_EX[:, f.sales, f.quantity]
In this case, It is displaying the data from quantity to sales whereas it should display them in the specified order(sales,quantity). and here another observation from this output is that- quantity fields gets sorted in an ascending order.
Keeping this case a side, Now I have tried to pass the required fields in parenthesis as
DT_EX[:, (f.sales,f.quantity)]
Here It is now producing the correct output without any sorting/jumbled of fields
It's always recommended to pass the fields to be selected in parenthesis.
Finally, I would be interested to know what has happened in the first case?
, would you please explain it clearly ?.

The primary syntax of datatable is
DT[i, j, by, ...]
That is, when you write a sequence of expressions in the square brackets, the first one is interpreted as i (row filter), the second as j (column selector), the third as by (group-by variable).
Normally, you would use a by() function to express a group-by condition, but the old syntax allowed to specify a bare column name in the third place in DT[], and it was interpreted as a group-by variable. Such use is considered deprecated nowadays, and may eventually get removed, but at least for now it is what it is.
Thus, when you write DT_EX[:, f.sales, f.quantity], the quantity column is interpreted as a group by condition (and since j does not have any reduction operations, it works essentially as a sort). Another effect of using a grouping variable is that it is moved to the front of the resulting frame, which essentially means you'll see columns (quantity, sales) in the "opposite" order of how they were listed.
If all you need is to select 2 columns from a frame, however, then you need to make sure those 2 columns are both within the j position in the list of arguments to DT[...]. This can be done via a list, or a tuple, or a dictionary:
DT_EX[:, [f.sales, f.quantity]]
DT_EX[:, (f.sales, f.quantity)]
DT_EX[:, {"SALES": f.sales, "QUANT": f.quantity}]

Related

Different bracket types python

So I've been following a guide I got off reddit about understanding python's bracket requirements:
Is it a list? Then use brackets.
Is it a dict? Then use braces.
Otherwise, you probably want parentheses.
However, I've come across something that the above can't explain:
df.groupby('Age')['Salary'].mean()
In this case, both Age & Salary are lists (they are both columns from a df), so why do we use parentheses for Age and brackets for Salary?
Additionally, why is there a dot before mean, but not in-between ('Age') and ['Salary']?
I realise the questions I'm asking may be fairly basic. I'm working my way through Python Essential Reference (4th ed) Developer's Library. If anyone has any sources dealing with my kind of questions it would be great to see them.
Thanks
If you'll forgive me for answering the important question rather than the one you asked...
That's a very compact chain. Break it into separate lines and then use the Debugging view of an IDE to step through it the understand the datatypes involved.
query_method = df.groupby
query_string = 'Age'
query_return = query_method(query_string)
data = query_return['Salary']
data_mean = data.mean()
Step through in the PyCharm Debugger and you can see type for every variable.
There is a lot of context here that can be found in the pandas dataframe documentation.
To start off, df is an object of class pandas.DataFrame.
pandas.DataFrame has a function called groupby that takes some input. In your example, the input is 'Age'. When you pass arguments to a function it looks like this:
my_function(input)
when you have more than one input, the common way to pass them is as multiple variables, like this
my_function(input1, input2, etc, ...)
pandas.DataFrame.groupby(...) returns an object that is subscriptable or sliceable. Using slice notation is like accessing an element in an list or a dict, like this
my_list = [1,2,3]
print(my_list[0]) # --> 1
my_dict = {
"a": "apple",
"b": "banana",
"c": "cucumber"
}
print(my_dict["b"]) # --> banana
coming back to your specific question:
df.groupby('Age')['Salary'].mean()
df # df, the name of your DataFrame variable
.groupby('Age') # call the function groupby to get the frame grouped by the column 'Age'
['Salary'] # access the 'Salary' element from that groupby
.mean() # and apply the mean() function to the 'Salary' element
So it appears that you are getting a list of all the the mean salaries by age of the employee.
I hope this helps to explain
both Age & Salary are lists (they are both columns from a df),
They're Ranges / Columns, not lists. The group by function of a Dataframe returns an indexed object. Calling methods requires parenthesis, like print(). You can use square brackets to access indexed data (ref. dict() objects).
The period and paranthesis afterwards is another function call
why is there a dot before mean, but not in-between ('Age') and ['Salary']
Short answer is that foo.['bar'] is not valid syntax
But df.groupBy("Age").some_func() certainly could be done, depending on the available functions on that object

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Checking dataframe cells to see if they contain a value

Let's say I have a fairly simple code such as
import pandas
df_import=pandas.read_excel("dataframe.xlsx")
df_import['Company'].str.contains('value',na=False,case=False)
So this obviously imports pandas, creates a dataframe from an excel documentment and then searches the column titled Company for some value, and returns an index saying if the value of that cell contains that value (True or False)
However, I want to test 3 cases. Case A, no results were found (all False), case 2, only 1 case was found (only 1 True) and case 3, more that 1 result was found (# of True > 1).
My though is that I could set up a for loop, iterating through the column, and if a value of a cell is True, I add 1 to a variable (lets call it count). Then at the end, I have an if/elif/elif statement based on the value of count, whether it is 0,1,or >1.
Now, maybe there is a better way to check this but if not, I figured the for loop would look something like
for i in range (len(df_improt.index))
if df_import.iloc[i,0].str.contains('value',na=False,case=False)
count += 1
First of all, I'm not sure if I should use .iloc or .iat but both give me the error
AttributeError: 'str' object has no attribute 'str'
and I wasn't able to find a correction for this.
Your current code is not going to work because iloc[i, 0] returns a scalar value, and of course, those don't have str accessor methods associated with them.
A quick and easy fix would be to just call sum on the series level str.contains call.
count = df_import['Company'].str.contains('value', na=False, case=False).sum()
Now, count contains the number of matches in that column.

creating a mapping variable in python

This is a little extension to my previous problem. I have a variable known as region with values such as:
/health/blood pressure
/health/diabetes
/cmd/diagnosis
/fitness
/health
/health/type1 diabetes
Now I want to add one more variable known as region_map alongside with region variable which maps each region to a unique name such as Region1, Region2, so on as they appear sequentially in the column. So the output would look like:
/health/blood pressure Region1
/health/diabetes Region2
/cmd/diagnosis Region3
/fitness Region4
/health Region5
/health/type1 diabetes Region6
Also all the region values are unique.
Basically its a web related data, but not now its a stand alone task. In python shell I have imported the data which just consists of unique list of regions. Now first task was to remove the integer values, which I did successfully with the help of other members. Now I have around 2000+ entries for regions. Sample data as i mentioned above, you can see it. For subsequent analysis like merging it with other files, I want to add mapping variable to the existing column, which is region. So for each unique 2000+ entries I want separate mapping variable such as region 1, region 2, region 3, region 2000+. I have created a dictionary with following code:
mapping={region:index for index, region in enumerate(data.region)}
But problem is how to loop through it and replacing existing values with region 1, region 2 and so on. Basically as i explained I want to add one more column with each entry mapping to different region. Hope this explains you well!!
You can use a dictionary so you will have keys as names and the regions as values.
You can use it like this:
dict[key] = value
This will create a new key and assign a value for it.
You can put the key like this "/health/blood pressure" and then the value like this "Region 1".
When you refer to it:
dict["/health/blood pressure"]
It will return "Region 1".
To loop through the dictionary, do the following:
for i in dictionary:
# condition to see if you need to change the current value
dictionary[i] = "New Value" # The I represents the key in the dictionary.
For more information look in the python manual, cause these are just basic information for using dictionaries, and if you are doing a web service personally I would prefer using a DB for this. But again I don't know the bigger picture so its hard to say.
I assume that you already have an array list of the regions and needs new array list with the mapping of REGION and REGION MAP.
region = ["/health/blood pressure", "/health/diabetes", "/cmd/diagnosis" ,"/fitness", "/health", "/health/type1 diabetes"]
region_mapped = {}
region_count = len(region)
for region_index in range(0, region_count):
region_mapped[region[region_index]] = "Region-%s" % region_index
print(region_mapped)
As its available with LIST - DICT can not append/extend the elements.
As you will need key:value pair kind of elements then above example is best suited. Still you can make it even optimised by creating def.

django tutorial part 1 (p.choice_set.all() returns in reverse order)

https://docs.djangoproject.com/en/1.5/intro/tutorial01/
I am following the Django tutorial and near the end of part 1 where p.choice_set.all() is called I get the display in reverse order.
For example I get: Just hacking again, The Sky, Not much
Instead of: Not much, The sky, Just hacking again
Any idea why I get reverse order? I typed it in the same order as the tutorial indicated.
I am running with PostgreSQL.
If you don't explicitly do an order_by() there's no guarantee which order the results will be returned in.
Not sure about PostgreSQL, but for MySQL, in practice, if you're only selecting a subset of rows based on some indexed column, it'll order by that index, but if you're selecting every row, it'll use a table scan and return the results in the order in which they're stored on disk.
With regards to disk order, if you've ever deleted a row, then it'll try to fill in the space used by that row when you next insert a row, so if you had...
ID Name
1 Bob
2 Fred
3 Jim
...then delete the first row, then add in a new one, you get...
ID Name
4 Jeff
2 Fred
3 Jim
...at which point a SELECT * FROM my_table will return them in the order 4, 2, 3, whereas a SELECT * FROM my_table WHERE id > 1 will (probably) return them in the order 2, 3, 4.
Simple explanation:
You can order fields by order_by(), if you don't use it it will be ordered by id.And thats why you are getting revere order.
You can check it by simple for loop:
for item in p.choice_set.all():
print item.id
Thanks

Categories