I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.
I would like to create a matrix of delay from a timeserie.
For example if
y = [y_0, y_1, y_2, ..., y_N] and W = 5
I would like to create the matrix
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | y_0 |
| 0 | 0 | 0 | y_0 | y_1 |
| ... | | | | |
| y_{N-4} | y_{N-3} | y_{N-2} | y_{N-1} | y_N |
I know that function timeseries_dataset_from_array from tensorflow do approximatively the same thing when well configured but I would like to avoid using tensorflow.
This is my current function to perform this task:
def get_warm_up_matrix(_data: ndarray, W: int) -> ndarray:
"""
Return a warm-up matrix
If _data = [y_1, y_2, ..., y_N]
The output matrix W will be
W = +---------+-----+---------+---------+-----+
| 0 | ... | 0 | 0 | 0 |
| 0 | ... | 0 | 0 | y_1 |
| 0 | ... | 0 | y_1 | y_2 |
| ... | ... | ... | ... | ... |
| y_1 | ... | y_{W-2} | y_{W-1} | y_W |
| ... | ... | ... | ... | ... |
| y_{N-W} | ... | y_{N-2} | y_{N-1} | y_N |
+---------+-----+---------+---------+-----+
:param _data:
:param W:
:return:
"""
N = len(_data)
warm_up = np.zeros((N, W), dtype=_data.dtype)
raw_data_with_zeros = np.concatenate((np.zeros(W, dtype=_data.dtype), _data), dtype=_data.dtype)
for k in range(W, N + W):
warm_up[k - W, :] = raw_data_with_zeros[k - W:k]
return warm_up
It works well, but it's quite slow since the concatenate operation and the for loop take time to be performed. It also take a lot of memory since the data have to be duplicated in memory before filling the matrix.
I would like a faster and memory-friendly method to perform the same task. Thanks for your help :)
I have a table in the shape of a symmetric matrix that tells me which components are compatible. Here is an example;
Components | A | B | C | D | E | F | G |
-----------+---+---+---+---+---+---+---+
A | | | 1 | 1 | 1 | 1 | |
-----------+---+---+---+---+---+---+---+
B | | | | | 1 | | 1 |
-----------+---+---+---+---+---+---+---+
C | 1 | | | | | 1 | |
-----------+---+---+---+---+---+---+---+
D | 1 | | | | | 1 | 1 |
-----------+---+---+---+---+---+---+---+
E | 1 | 1 | | | | | 1 |
-----------+---+---+---+---+---+---+---+
F | 1 | | 1 | 1 | | | 1 |
-----------+---+---+---+---+---+---+---+
G | | 1 | | 1 | 1 | 1 | |
-----------+---+---+---+---+---+---+---+
Where the 1s show what is compatible and the blanks are what is not compatible. In the actual table there are a lot more components. Currently the real table is in an excel spreadsheet but could easily be converted to csv or text for convenience.
What I need to do is create a list of possible combinations. I know there are things like itertools but I need it to only create a list of the compatible ones and ignore the non compatible ones. For this with a dat file I pull when I run pyomo;
set NODES := A B C D E F G;
param: ARCS:=
A
B
C
...
A C
A D
B E
...
A C F
BGE
...
I need everything listed together to be compatible together. So ACF can be together because they are all compatible with each other but not ADG because G is not compatible with A.
Long Term Plan:
Eventually I plan to use Pyomo to find the the best combination of components to minimize the resources needed associated with each component. Therefore in the dat file there will eventually be and additional cost associated with each combination.
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
df = pd.read_excel(r"/path/to/file.xlsx", sheet_name="Sheet4",index_col=0,usecols = "A:H")
df.edge=nx.from_pandas_adjacency(df)
print(list(nx.enumerate_all_cliques(nx.Graph(df.edge))))
I'm scraping multiple sports betting websites in order to compare the odds for each match across the websites.
My question is how to identify match_id from a match that already exists in the DB but has team names written in a different way.
Please feel free to add any approaches even if they don't use dataframes or SQLite.
The columns for matches table are:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)
So for each new match I want to verify if it already exists in the DB.
New match = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
My pseudo-code is like:
SELECT match_id FROM matches
WHERE sport = (sport_to_check)
AND date = (date_to_check)
AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation
If no match is found the new row would be inserted.
I believe there's no way to mix python with SQL like that so that's why I refer to it as "pseudo-code".
I can also pull the matches table into a Pandas dataframe and do the evaluation with it, if that works (how?...).
At any given time it isn't expected for matches table to have above a couple of thousand records.
Let me give you some examples of expected outputs. Where the solution is represented by "find(row)"
Having matches table in DB as:
+----------+------------+-----------------------------+----------------------+------------+
| match_id | sport | home_team | visitor_team | date |
+----------+------------+-----------------------------+----------------------+------------+
| 84 | football | confianca | cuiaba esporte clube | 24/11/2020 |
| 209 | football | cs alagoana | operario pr | 24/11/2020 |
| 184 | football | grenoble foot 38 | as nancy lorraine | 24/11/2020 |
| 7 | football | sv turkgucu-ataspor munchen | saarbrucken | 24/11/2020 |
| 414 | handball | dinamo bucareste | usam nimes | 24/11/2020 |
| 846 | handball | benidorm | naturhouse la rioja | 25/11/2020 |
| 874 | handball | cegledi | ferencvarosi tc | 25/11/2020 |
| 418 | handball | lemvig-thyboron | kif kolding | 25/11/2020 |
| 740 | ice hockey | tps | kookoo | 25/11/2020 |
| 385 | football | stevenage | hull | 29/11/2020 |
+----------+------------+-----------------------------+----------------------+------------+
And new matches to evaluate:
+----------------+------------+---------------------+---------------------+------------+
| row (for demo) | sport | home_team | visitor_team | date |
+----------------+------------+---------------------+---------------------+------------+
| A | football | confianca-se | cuiaba mt | 24/11/2020 |
| B | football | csa | operario | 24/11/2020 |
| C | football | grenoble | nancy | 24/11/2020 |
| D | football | sv turkgucu ataspor | 1 fc saarbrucken | 24/11/2020 |
| E | handball | dinamo bucuresti | nimes | 24/11/2020 |
| F | handball | bm benidorm | bm logrono la rioja | 25/11/2020 |
| G | handball | cegledi kkse | ftc budapest | 25/11/2020 |
| H | handball | lemvig | kif kobenhavn | 25/11/2020 |
| I | ice hockey | turku ps | kookoo kouvola | 25/11/2020 |
| J | football | stevenage borough | hull city | 29/11/2020 |
| K | football | west brom | sheffield united | 28/11/2020 |
+----------------+------------+---------------------+---------------------+------------+
Outputs:
find(A) returns: 84
find(B) returns: 209
find(C) returns: 184
find(D) returns: 7
find(E) returns: 414
find(F) returns: 846
find(G) returns: 874
find(H) returns: 418
find(I) returns: 740
find(J) returns: 385
find(K) returns: (something like "not found" => I would then insert the new row)
Thanks!
Basically I filter down the original table by the given date and sport. then use fuzzywuzzy to find the best match between the home and visitors between the rows remaining:
Setup:
import pandas as pd
cols = ['match_id','sport','home_team','visitor_team','date']
df1 = pd.DataFrame([
['84','football','confianca','cuiaba esporte clube','24/11/2020'],
['209','football','cs alagoana','operario pr','24/11/2020'],
['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
['740','ice hockey','tps','kookoo','25/11/2020'],
['385','football','stevenage','hull','29/11/2020']], columns=cols)
cols = ['row','sport','home_team','visitor_team','date']
df2 = pd.DataFrame([
['A','football','confianca-se','cuiaba mt','24/11/2020'],
['B','football','csa','operario','24/11/2020'],
['C','football','grenoble','nancy','24/11/2020'],
['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
['E','handball','dinamo bucuresti','nimes','24/11/2020'],
['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
['H','handball','lemvig','kif kobenhavn','25/11/2020'],
['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
['J','football','stevenage borough','hull city','29/11/2020'],
['K','football','west brom','sheffield united','28/11/2020']], columns=cols)
Code:
import pandas as pd
from fuzzywuzzy import fuzz
import string
def calculate_ratio(row):
return fuzz.token_set_ratio(row['col1'],row['col2'] )
def find(df1, df2, row_search):
alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
sport = alpha.iloc[0]['sport']
date = alpha.iloc[0]['date']
home_team = alpha.iloc[0]['home_team']
visitor_team = alpha.iloc[0]['visitor_team']
beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
if len(beta) == 0:
return 'Not found.'
else:
temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
temp['score'] = temp.apply(calculate_ratio, axis=1)
temp = temp.sort_values('score', ascending=False)
outcome = temp.head(1).iloc[0]['match_id']
return outcome
for row_alpha in string.ascii_uppercase[0:11]:
outcome = find(df1, df2, row_alpha)
print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))
Output:
A --> 84
B --> 209
C --> 184
D --> 7
E --> 414
F --> 846
G --> 874
H --> 418
I --> 740
J --> 385
K --> Not found.
I am conducting a data analysis in both R and Python to compare their differences. Currently I am struggling to translate
data %>%
mutate(pct_leader = ballotsLeader/validBallots) %>%
group_by(community) %>%
mutate(mean_pct_leader = mean(pct_leader),
sd_pct_leader = sd(pct_leader),
up_pct_leader = mean_pct_leader+2*sd_pct_leader) %>%
filter(pct_leader > up_pct_leader) %>%
top_n(5, pct_leader)
into Python.
I have tried the following python code
grouped = data.assign(pct_leader = lambda x: x['ballotsLeader']/x['validBallots']).groupby('community').assign(mean_pct_leader = lambda x: mean(x['pct_leader']),
sd_pct_leader = lambda x: stdev(x['pct_leader']),
up_pct_leader = lambda x: x['mean_pct_leader']+2*x['sd_pct_leader']).query('pct_leader > up_pct_leader').pct_leader.nlargest(5)
but get a AttributeError: 'DataFrameGroupBy' object has no attribute 'assign' error.
I realize this is because the DataFrameGroupBy object does not have the assign method.
How can preserve the order of the R code but translate it into python?
Edit: Here is the data I am working with
| community | province | municipality | precinct | registeredVoters | emptyBallots | invalidBallots | validBallots | ballotsLeader |
|-----------|-----------|--------------|----------|------------------|--------------|----------------|--------------|---------------|
| GALICIA | Coruña, A | Ames | 001 B | 270 | 3 | 7 | 206 | 129 |
| GALICIA | Coruña, A | Ames | 004 A | 356 | 2 | 7 | 257 | 136 |
| GALICIA | Coruña, A | Ames | 002 C | 296 | 1 | 2 | 214 | 149 |
| GALICIA | Coruña, A | Ames | 010 U | 646 | 15 | 10 | 507 | 189 |
| GALICIA | Coruña, A | Ames | 012 B | 695 | 6 | 8 | 479 | 247 |
Without seeing some data, it's hard to have this correct, but this should work:
(data.assign(pct_leader=data['ballotsLeader'] / data['validBallots'])
.groupby('community').agg(
mean_pct_leader=('pct_leader', 'mean')
sd_pct_leader=('pct_leader', 'std'),
up_pct_leader=('pct_leader', lambda x: (x['pct_leader'].mean()+2) * x['pct_leader'].std())
)
.query('pct_leader > up_pct_leader')
.nlargest(5, 'pct_leader')
)
Using datar, it will be easy for you to translate your R code into python:
from datar.all import f, mutate, group_by, mean, sd, filter, slice_max
data >> \
mutate(pct_leader = f.ballotsLeader/f.validBallots) >> \
group_by(f.community) >> \
mutate(mean_pct_leader = mean(f.pct_leader),
sd_pct_leader = sd(f.pct_leader),
up_pct_leader = f.mean_pct_leader+2*f.sd_pct_leader) >> \
filter(f.pct_leader > f.up_pct_leader) >> \
slice_max(f.pct_leader, n=5)
# top_n() has been superseded in favour of slice_min()/slice_max()
I am the author of the package. Feel free to submit issues if you have any questions.