Is there a python package that can find the most impactful group (categorical features) from my data? - python

My problem is that I have a dataset of our campaign like this:
| Customer | Province | District | City | Age | No. of Order |
| -------- | ------- | -------- | -----| ----| ------- |
| A | P1 | D1 | C1 | 21 | 5 |
| B | P2 | D2 | C2 | 22 | 9 |
....
And I need to find the most impactful group of customers (usually there will be >20 categorical groups). For example: "Customers from Province P1, District D1, Age 25 are the most promising group because they contributed 50% total order while being 10% of our customer base".
I'm currently using Pandas to loop through all the combinations of [2,3,4] from all my categorical features and calculate the sale proportion for each group but it is very time-consuming
I want to ask if there is already a Python package that can help to find that kind of group?

You can automate that by using Decision Trees.
Not all features may be useful. Eliminate trivial ones using PCA (principal component analysis)
You may use scikit-learn package for both of above.

Related

Is there an algorithm for choosing which unknown points to input based on maximising a function of a row/column?

Is there an efficient algorithm for a small sparse table which will fill up with values based on selecting the unknown data points that has greatest POTENTIAL maximum scalar value taken as a function of a row/column for instance at the 95% confidence interval?
It would probably involve taking in each data point in the table currently and selecting the unknown data point which has the potential to maximise the result for the data. Once the result is entered the confidence interval will drop because #1) the sample size increases and #2) the result will most likely be somewhere around the other values and not the maximum. either way the same result cannot be entered.
Best explained with an example:
(the result to be maximised is average)
| apples | oranges |
| -------- | -------- |
alice | £4.50 | unknown |
bob | unknown | £4.00 |
collin| unknown | unknown
| apples | oranges |
| -------- | -------- |
alice | £4.50 | unknown |
bob | unknown | £4.00 |
collin| £5.00 | unknown. |
| apples | oranges |
| -------- | -------- |
alice | £4.50 | unknown |
bob | unknown | £4.00 |
collin| £5.00 | £6.00 |
| apples | oranges |
| -------- | -------- |
alice | £4.50 | unknown |
bob | £3.00 | £4.00 |
collin| £5.00 | £6.00 |
| apples | oranges |
| -------- | -------- |
alice | £4.50 | £9.00 |
bob | £3.00 | £4.00 |
collin| £5.00 | £6.00 |
note that it also models the confidence interval of each row as well as column. the column names and row names are the same in my data but (x,y) != (y,x) as the relationship is different so x^y != y^x if that helps.
Thank you so much
pseudo code:
for column in columns:
column_conf.append(conf_high(column))
for row in rows:
row_conf.append(conf_high(row))
for column_idx in range(columns):
for row_idx in range(rows):
x,y = maximum(combine_conf(column_conf(column_idx, row_conf(row_idx)))
edit:
The set of x or y attributes are words but converted into vectors they are 60 dimensional and continuous between 0 and 1. Assume I've already got a method for determining which new word to choose based on a filled out grid of values. And now I want to trade-off the accuracy of the word generator for the the n^2-n complexity of adding a new word in order to arrive at an accurate solution quicker.

Python, extracting features form time series (TSFRESH package or what can I use?)

I need some help for feature extraction in time series, maybe using the TSFRESH package.
I have circa 5000 CSV files, and each one of them is a single time series (they may differ in length). The CSV-time-series is pretty straight forward:
Example of a CSV-Time-Series file:
| Date | Value |
| ------ | ----- |
| 1/1/1904 01:00:00,000000 | 1,464844E-3 |
| 1/1/1904 01:00:01,000000 | 1,953125E-3 |
| 1/1/1904 01:00:02,000000 | 4,882813E-4 |
| 1/1/1904 01:00:03,000000 | -2,441406E-3 |
| 1/1/1904 01:00:04,000000 | -9,765625E-4 |
| ... | ... |
Along with these CSV files, I also have a metadata file (in a CSV format), where each row refers to one of those 5000 CSV-time-series, and reports more general information about that time series such as the energy, etc.
Example of the metadata-CSV file:
| Path of the CSV-timeseries | Label | Energy | Penetration | Porosity |
| ------ | ----- | ------ | ----- | ----- | ----------- |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
The most important column is the "Label" one since it reports if a CSV-time-series was labeled as:
Good
Bad
I should also consider the energy, penetration, and porosity columns since those values have a big role in the labeling of the time series. (I already tried a decision tree by looking at only the features, now I would like to analyze the time series to extract knowledge)
I intend to extract features from the time series such that I can understand what are the features that make one time series be labeled as "Good" or "Bad".
How can I do this with TSFRESH?
There are other ways to do this?
Could you show me how to do it? Thank you :)
I'm doing something similar currently and this example jupyter notebook from github helped me.
The basic process is in short:
Bring time series in acceptable format, see the tsfresh documentation for more information
Extract features from time serieses using X = extract_features(...)
Select relevant features using X_filtered = select_features(X, y) with y being your label, good or bad being e.g. 1 and 0.
Put select features into a classifier, also shown in the jupyter notebook.

Efficient way to write Pandas groupby codes by eliminating repetition

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Link lists that share common elements

I have an issue similar to this one with a few differences/complications
I have a list of groups containing members, rather than merging the groups that share members I need to preserve the groupings and create a new set of edges based on which groups have members in common, and do so conditionally based on attributes of the groups
The source data looks like this:
+----------+------------+-----------+
| Group ID | Group Type | Member ID |
+----------+------------+-----------+
| A | Type 1 | 1 |
| A | Type 1 | 2 |
| B | Type 1 | 2 |
| B | Type 1 | 3 |
| C | Type 1 | 3 |
| C | Type 1 | 4 |
| D | Type 2 | 4 |
| D | Type 2 | 5 |
+----------+------------+-----------+
Desired output is this:
+----------+-----------------+
| Group ID | Linked Group ID |
+----------+-----------------+
| A | B |
| B | C |
+----------+-----------------+
A is linked to B because it shares 2 in common
B is linked to C because it shares 3 in common
C is not linked to D, it has a member in common but is of a different type
The number of shared members doesn't matter for my purposes, a single member in common means they're linked
The output is being used as the edges of a graph, so if the output is a graph that fits the rules that's fine
The source dataset is large (hundreds of millions of rows), so performance is a consideration
This poses a similar question, however I'm new to Python and can't figure out how to get the source data to a point where I can use the answer, or work in the additional requirement of the group type matching
Try some thing like this-
df1=df.groupby(['Group Type','Member ID'])['Group ID'].apply(','.join).reset_index()
df2=df1[df1['Group ID'].str.contains(",")]
This might not handle the case of cyclic grouping.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

Categories