Calculate the mean value of pandas based on various features(columns) - python

Goal
I am writing a card game analysis scripts. For convenience, the data was stored in the excel sheets. So users can type in the information of each game into the excel sheets and use the python script to analyze the return of the game. 3 rivals are involved in a card game (4 person in total), and I want to analyze the overall return vs a certain player. eg. I want to know how much my dad has won when play cards with Tom.
Data
The excel sheet consists of several features like "date, start_time, end_time, duration, location, Pal1, Pal2, Pal3" and a target "Return" with positive number as gain and negative numbers as loss. The data was read using python pandas.
Problem
I did not figure out how to index a certain pal, as he/she may in one of the column "pal#". I need to calculate the mean value of return when a certain pal is involved.
Excel sheets(demo)
Code
path = 'excel.xlsx'
data_df = pd.read_excel(path)
def people_estimation(raw_data, name):
data = raw_data
df1 = data.pivot_table(columns=['牌友1'], values='收益', aggfunc=np.mean)
df2 = data.pivot_table(columns=['牌友2'], values='收益', aggfunc=np.mean)
df3 = data.pivot_table(columns=['牌友3'], values='收益', aggfunc=np.mean)
interest = (df1[name] + df2[name] + df3[name])/3
print("The gain with", name, "is :", interest)
Note
The code above achieve what I want. But I think there is a better way to do it. Can anyone help. Thanks in advance.

>>> a
a b c
0 2 2 1
1 3 1 2
2 4 1 3
>>> mask = (a['a']==2) | (a['c']==2)
0 True
1 True
2 False
dtype: bool
>>> a[mask]
a b c
0 2 2 1
1 3 1 2
>>> a[mask]['c']
0 1
1 2
Name: c, dtype: int64
>>> a[mask]['c'].mean()
1.5
I think in your code it's wrong that condition for a mask should by in a bracket.
data[(data['牌友1'] == 'Tom') | (data['牌友2'] == 'Tom') | (data['牌友3'] == 'Tom')]['收益'].mean()

Related

Efficiently concatenate/append dataframe in a for loop to get a single big dataframe using python pandas

Using a logic- I am reading multiple PDF files which are having certain highlighted portions(presume that these are tables).
After pushing them to a list, I am saving them to a dataframe.
Here's the logic for the same
try:
filepath = [file for file in glob.glob("Folder/*.pdf")]
for file in filepath:
doc = fitz.open(file)
print(file)
highlights = []
for page in doc:
highlights += handle_page(page)
#print(highlights)
highlights_alt = highlights[0].split(',')
df = pd.DataFrame(highlights_alt, columns=['Security Name'])
#print(df.columns.tolist())
df[['Security Name', 'Weights']] = df['Security Name'].str.rsplit(n=1, expand=True)
df.drop_duplicates(keep='first', inplace=True)
print(df.head())
print(df.shape)
except IndexError:
print('file {} is not highlighted'.format(file))
Using this logic I get the dataframes however if the folder has 5 PDFs then this logic creates 5 different dataframes. Something like this.
Folder\z.pdf
Security Name Weights
0 UTILITIES (5.96
1 %*) None
(2, 2)
Folder\y.pdf
Security Name Weights
0 Quantity/ Market Value % of Net Investments Cu... 1.125
1 % 01
2 /07 None
3 /2027 None
4 EUR 230
(192, 2)
Folder\x.pdf
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
(526, 2)
However I want a single dataframe with the above records in them making their shape as (720,2) something like
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
.
.
720 xyz 3.33
(720, 2)
I tried using pandas's concat & append but have been unsuccessful so far. Please let me know an efficient way of doing it since, the PDFs in future would be more than 1000s.
Please help!
A quick way is to use pd.concat:
big_df = pd.concat(list_of_dfs, axis=0)
If this creates an error it would be helpful to know what the error is.

How to find duplicated vaules in Tableau

I need to create a new column that advises if a customer is new or recurrent.
To do so I want to check, for each unique value in phone, if there is one or more date associated in the Date columns.
Phone Date
0 a 1
1 a 1
2 a 2
3 b 2
4 b 2
5 c 3
6 c 2
7 c 1
New users are those for whom there is only one unique (Phone, Date) couple with the same phone. The result that I want looks like:
Phone Date User_type
0 a 1 recurrent
1 a 1 recurrent
2 a 2 recurrent
3 b 2 new
4 b 2 new
5 c 3 recurrent
6 c 2 recurrent
7 c 1 recurrent
I manage to do it in few lines of code with python but my boss want insist that I do it in Tableau.
I know I need to use a calculated field but that's it.
If it can help, here is my python code that does the same:
import numpy as np
import pandas as pd
for item in set(data.Phone):
if len(set(data[data.Phone == item]['Date'])) == 1:
data.loc[data.Phone == item, 'type_user'] = 'new'
elif len(set(data[tata.Phone == item]['Date'])) > 1:
data.loc[data.Phone == item, 'type_user'] = 'recurrent'
else:
data.loc[data.Phone == item, 'type_user'] = np.nan
You can use LOD to do that, the below will give you how many records are duplicated
{Fixed [Phone],[Date]: SUM([Number of Records])}
If you want a text, do:
IF {Fixed [Phone],[Date]: SUM([Number of Records])} > 1 THEN 'recurrent' ELSE 'new' END
Example:
Thanks for you reply! It didn't exactly solved my problem but it definitely helped me finding the solution.
The solution:
I first got, for a given phone, the number of distinct Date
{Fixed [Phone] : COUNT([Date])}
Then I created my categorical (dimension) variable
if {Fixed [Phone] : COUNT([Date])} > 1 THEN 'recurrent' ELSE 'new' END
The result (phone numbers are hidden for data privacy reasons):
enter image description here

New column based off certain input parameter to select what columns to use - Python

Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555

Pandas merge - combination of and and or conditions [duplicate]

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

Categories