How to find duplicated vaules in Tableau - python

I need to create a new column that advises if a customer is new or recurrent.
To do so I want to check, for each unique value in phone, if there is one or more date associated in the Date columns.
Phone Date
0 a 1
1 a 1
2 a 2
3 b 2
4 b 2
5 c 3
6 c 2
7 c 1
New users are those for whom there is only one unique (Phone, Date) couple with the same phone. The result that I want looks like:
Phone Date User_type
0 a 1 recurrent
1 a 1 recurrent
2 a 2 recurrent
3 b 2 new
4 b 2 new
5 c 3 recurrent
6 c 2 recurrent
7 c 1 recurrent
I manage to do it in few lines of code with python but my boss want insist that I do it in Tableau.
I know I need to use a calculated field but that's it.
If it can help, here is my python code that does the same:
import numpy as np
import pandas as pd
for item in set(data.Phone):
if len(set(data[data.Phone == item]['Date'])) == 1:
data.loc[data.Phone == item, 'type_user'] = 'new'
elif len(set(data[tata.Phone == item]['Date'])) > 1:
data.loc[data.Phone == item, 'type_user'] = 'recurrent'
else:
data.loc[data.Phone == item, 'type_user'] = np.nan

You can use LOD to do that, the below will give you how many records are duplicated
{Fixed [Phone],[Date]: SUM([Number of Records])}
If you want a text, do:
IF {Fixed [Phone],[Date]: SUM([Number of Records])} > 1 THEN 'recurrent' ELSE 'new' END
Example:

Thanks for you reply! It didn't exactly solved my problem but it definitely helped me finding the solution.
The solution:
I first got, for a given phone, the number of distinct Date
{Fixed [Phone] : COUNT([Date])}
Then I created my categorical (dimension) variable
if {Fixed [Phone] : COUNT([Date])} > 1 THEN 'recurrent' ELSE 'new' END
The result (phone numbers are hidden for data privacy reasons):
enter image description here

Related

How to filter a dataset based on keys from another dataset

I've a dataset of book ratings which looks like:
ratings.head()
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
and I want to filter the dataset by users who liked particular book.
I've tried:
lotr_ratings = ratings[ratings['ISBN'] == '0345339703']
liked_lotr = lotr_ratings[lotr_rating['Book-Rating'] == 10] #readers who like lotr
liked_lotr = liked_lotr['User-ID'].to_frame()
ratings[ratings['User-ID'] == liked_lotr] # Filter the original dataset
Failed for:
MemoryError
Help would be appreciated. Thanks.
Looks like you just want to create a new dataframe based on multiple conditions. Do it like this:
conditions = (ratings['ISBN'] == '0345339703') & (ratings['Book-Rating'] == 10)
like_lotr = ratings[conditions]

Calculate the mean value of pandas based on various features(columns)

Goal
I am writing a card game analysis scripts. For convenience, the data was stored in the excel sheets. So users can type in the information of each game into the excel sheets and use the python script to analyze the return of the game. 3 rivals are involved in a card game (4 person in total), and I want to analyze the overall return vs a certain player. eg. I want to know how much my dad has won when play cards with Tom.
Data
The excel sheet consists of several features like "date, start_time, end_time, duration, location, Pal1, Pal2, Pal3" and a target "Return" with positive number as gain and negative numbers as loss. The data was read using python pandas.
Problem
I did not figure out how to index a certain pal, as he/she may in one of the column "pal#". I need to calculate the mean value of return when a certain pal is involved.
Excel sheets(demo)
Code
path = 'excel.xlsx'
data_df = pd.read_excel(path)
def people_estimation(raw_data, name):
data = raw_data
df1 = data.pivot_table(columns=['牌友1'], values='收益', aggfunc=np.mean)
df2 = data.pivot_table(columns=['牌友2'], values='收益', aggfunc=np.mean)
df3 = data.pivot_table(columns=['牌友3'], values='收益', aggfunc=np.mean)
interest = (df1[name] + df2[name] + df3[name])/3
print("The gain with", name, "is :", interest)
Note
The code above achieve what I want. But I think there is a better way to do it. Can anyone help. Thanks in advance.
>>> a
a b c
0 2 2 1
1 3 1 2
2 4 1 3
>>> mask = (a['a']==2) | (a['c']==2)
0 True
1 True
2 False
dtype: bool
>>> a[mask]
a b c
0 2 2 1
1 3 1 2
>>> a[mask]['c']
0 1
1 2
Name: c, dtype: int64
>>> a[mask]['c'].mean()
1.5
I think in your code it's wrong that condition for a mask should by in a bracket.
data[(data['牌友1'] == 'Tom') | (data['牌友2'] == 'Tom') | (data['牌友3'] == 'Tom')]['收益'].mean()

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

How to create a loop or function for the logic for this list of lists

I have a data set that looks like this:
CustomerID EventID EventType EventTime
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
6 5 Facebook 42377.40598
6 6 Web 42378.31245
CustomerID: the unique identifier associated with the particular
customer
EventID: a unique identifier about a particular online activity
EventType: the type of online activity associated with this record
(Web, Facebook, or Twitter)
EventTime: the date and time at which this online activity took
place. This value is measured as the number of days since January 1, 1900, with fractions indicating particular times of day. So for example, an event taking place at the stroke of midnight on January 1, 2016 would be have an EventTime of 42370.00 while an event taking place at noon on January 1, 2016 would have an EventTime of 42370.50.
I've managed to import the CSV and creating into a list with the following code:
# Import Libraries & Set working directory
import csv
# STEP 1: READING THE DATA INTO A PYTHON LIST OF LISTS
f = open('test1000.csv', "r") # Import CSV as file type
a = f.read() # Convert file type into string
split_list = a.split("\r") # Removes \r
split_list[0:5] # Viewing the list
# Convert from lists to 'list of lists'
final_list = []
for row in split_list:
split_list = row.split(',') # Split list by comma delimiter
final_list.append(split_list)
print(final_list[0:5])
#CREATING INITIAL BLANK LISTS FOR OUTPUTTING DATA
legit = []
fraud = []
What I need to do next is sort each record into the fraud or legit list of lists. A record would be considered fraudulent under the following parameters. As such, that record would go to the fraud list.
Logic to assign a row to the fraud list: The CustomerID performs the same EventType within the last 4 hours.
For example, row 2 (event 2) in the sample data set above, would be moved to the fraud list because event 1 happened within the last 4 hours. On the other hand, event 4 would go to the legit list because in there are no Twitter records that happened in the last 4 hours.
The data set is in chronological order.
This solution groups by CustomerID and EventType and then checks if the previous event time occurred less than (lt) 4 hours ago (4. / 24).
df['possible_fraud'] = (
df.groupby(['CustomerID', 'EventType'])
.EventTime
.transform(lambda group: group - group.shift())
.lt(4. / 24))
>>> df
CustomerID EventID EventType EventTime possible_fraud
0 6 1 Facebook 42373.31586 False
1 6 2 Facebook 42373.31600 True
2 6 3 Web 42374.32921 False
3 6 4 Twitter 42377.14913 False
4 6 5 Facebook 42377.40598 False
5 6 6 Web 42378.31245 False
>>> df[df.possible_fraud]
CustomerID EventID EventType EventTime possible_fraud
1 6 2 Facebook 42373.316 True
Of course, pandas based solution seems more smart, but here is an example of using just inserted dictionary.
PS Try to perform input and output by himself
#!/usr/bin/python2.7
sample ="""
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
5 5 Web 42377.3541
6 6 Facebook 42377.40598
6 7 Web 42378.31245
"""
last = {} # This dict will contain recent time
#values of events by client ID, for ex.:
#{"6": {"Facebook": 42373.31586, "Web": 42374.32921}}
legit = []
fraud = []
for row in sample.split('\n')[1:-1:]:
Cid, Eid, Type, Time = row.split()
if Cid not in last.keys():
legit.append(row)
last[Cid] = {Type: Time}
row += '\tlegit'
else:
if Type not in last[Cid].keys():
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
if float(Time) - float(last[Cid][Type]) > (4. / 24):
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
fraud.append(row)
row += '\tfraud'
print row

How to divide a dbf table to two or more dbf tables by using python

I have a dbf table. I want to automatically divide this table into two or more tables by using Python. The main problem is, that this table consists of more groups of lines. Each group of lines is divided from the previous group by empty line. So i need to save each of groups to a new dbf table. I think that this problem could be solved by using some function from Arcpy package and FOR cycle and WHILE, but my brain cant solve it :D :/ My source dbf table is more complex, but i attach a simple example for better understanding. Sorry for my poor english.
Source dbf table:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
4
5 D 2
6 E 3
I want get dbf1:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
I want get dbf2:
ID NAME TEAM
1 D 2
2 E 3
Using my dbf package it could look something like this (untested):
import dbf
source_dbf = '/path/to/big/dbf_file.dbf'
base_name = '/path/to/smaller/dbf_%03d'
sdbf = dbf.Table(source_dbf)
i = 1
ddbf = sdbf.new(base_name % i)
sdbf.open()
ddbf.open()
for record in sdbf:
if not record.name: # assuming if 'name' is empty, all are empty
ddbf.close()
i += 1
ddbf = sdbf.new(base_name % i)
continue
ddbf.append(record)
ddbf.close()
sdbf.close()

Categories