I have been playing around with a dataset about football, and need to group my ['position'] by values, and assign them to a new variable.
First, here is my dataframe
df = player_stats[['id','player','date','team_name','fixture_name','position','shots', 'shots_on_target', 'xg',
'xa', 'attacking_pen_area_touches', 'penalty_area_entry_passes',
'carries_total_distance', 'total_distance_passed', 'aerial_sucess_perc',
'passes_attempted', 'passes_completed', 'short_pass_accuracy_perc', 'medium_pass_accuracy_perc',
'long_pass_accuracy_perc', 'final_third_entry_passes', 'carries_total_distance', 'ball_recoveries',
'total_distance_passed', 'dribbles_completed', 'dribbles_attempted', 'touches',
'tackles_won', 'tackles_attempted']]
I have split my ['position'] as it had multiple string-values, and added them to a column called ['position_new].
position_new
AM 277
CB 938
CM 534
DF 7
DM 604
FW 766
GK 389
LB 296
LM 149
LW 284
MF 5
RB 300
RM 160
RW 323
WB 275
What I need, is basically to have 3 different variables who have all the same columns, but are separated by the value in the position_new. Look at the below scheme:
So: my variable: Att, need to have all the columns of df, but only with values in position_new that are equal too: FW, LF, RW.
I know how to hardcode it, but cannot get my head around, how to transform it into a for loop.
Here is my loop..
for col in df[29:30]:
if df.loc[df['position_new'] == 'FW', 'LW', 'RW']:
att = df
elif df.loc[df['position_new'] == 'AM', 'CM', 'DM', 'LM', 'RM']:
mid = df
else:
defender = df
Thank you!
I'm not sure what you are trying to do but it looks like you want to get all positions that are of type attackers, midfielders, and defenders based on their two-letter abbreviation into separate variables.
What you are doing is not optimal because it won't work on any generic data frame with this type of info.
But, if you want to do it for just this case, you are simply missing the correct comparison operator in your for loop. Try:
if df.loc[df['position_new'].isin(['FW', 'LW', 'RW'])]:
Related
I have a dataset where one of the columns' rows are populated with strings with a few pieces of information, some of which I would like to extract into a different column.
Currently, I have something like this:
name
price
A CERUMEN UNIDOSE B/10 AFR
125
ACARILBIAL SOL EXT FL/200ML
8569
ACCULOL 0.5% CY FL/5ML
563
ACEFLAMEX 100MG CP B/20
12563
ACFOL 5MG COMP B/25
896
What I would like to have is a separate column for the medications that include some measurement in the name (that is, 200ml or 100mg) and if they don't, a missing value. All the measurements in the dataset are either mg, ml or g
Ideally, something like this:
name
price
measurement
A CERUMEN UNIDOSE B/10 AFR
125
"nan"
ACARILBIAL SOL EXT FL/200ML
8569
"200ML"
ACCULOL 0.5% CY FL/5ML
563
"5ML"
ACEFLAMEX 100MG CP B/20
12563
"100MG"
ACFOL 5MG COMP B/25
896
"5MG"
I tried turning the whole name column into a list of lists and transfering only the words that end in ml, mg and g into a separate list but then I could not match it back to the data frame
How should I go about doing this?
Thanks for the help!
You can apply regex to the column "name".
import re
import pandas as pd
df = pd.DataFrame({'name': ['A CERUMEN UNIDOSE B/10 AFR', 'ACARILBIAL SOL EXT FL/200ML', 'ACCULOL 0.5% CY FL/5ML', 'ACEFLAMEX 100MG CP B/20', 'ACFOL 5MG COMP B/25'],
'price': [125, 8569, 563, 12563, 896]})
df['measurement'] = df['name'].apply(lambda x: re.search(r'([\d]+)(?:MG|ML)', x).group()
if re.search(r'([\d]+)(?:MG|ML)', x) else "nan")
name
price
measurement
A CERUMEN UNIDOSE B/10 AFR
125
nan
ACARILBIAL SOL EXT FL/200ML
8569
200ML
ACCULOL 0.5% CY FL/5ML
563
5ML
ACEFLAMEX 100MG CP B/20
12563
100MG
ACFOL 5MG COMP B/25
896
5MG
You can also check pandas functions such as extract and replace.
I am trying to implement a logic where i have two dataframes. Say A (left) and B (right).
I need to find matching rows of A in B (i understand this can be done via a "inner" join). But my use case says i only need all rows from dataframe A but also a column from the matched record in B, the ID column (i understand this can be one via select). But here the problem arises, as after inner join the returned dataframe has rows from dataframe B also, which i dont need, but if i take leftsemi, then i wont be able to fetch the ID column from B dataframe.
For example:
def fetch_match_condition():
return ["EMAIL","PHONE1"]
match_condition = fetch_match_condition()
from pyspark.sql.functions import sha2, concat_ws
schema_col = ["FNAME","LNAME","EMAIL","PHONE1","ADD1"]
schema_col_master = schema_col.copy()
schema_col_master.append("ID")
data_member = [["ALAN","AARDVARK","lbrennan.kei#malchikzer.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","8653827956","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
master_df = spark.createDataFrame(data_member,schema_col)
master_df = master_df.withColumn("ID", sha2(concat_ws("||", *master_df.columns), 256))
test_member = [["ALAN","AARDVARK","lbren.kei#malchik.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","99997492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","87463829","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
test_member_1 = [["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"]]
test_df = spark.createDataFrame(test_member,schema_col)
test_df_1 = spark.createDataFrame(test_member_1,schema_col)
matched_df = test_df.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
master_df = master_df.union(matched_df.select(schema_col_master))
# Here only second last record will match and get added back to the master along with the same ID as it was in master_df
matched_df = test_df_1.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
# Here the problem arises since i need the match to be only 2, since the test_df_1 has only two records but the match will be 3 since master_df(david record) will also be selected.
PS : (Is there a way i can take leftsemi and then use withColumn and a UDF which fetches me this ID based on the row values of leftsemi dataframe from B dataframe ?)
Can anyone propose a solution for this ?
Actually I am stuck at a issue, my data is in the format given in below image,
Splitting data to multiple column
Is there any way in python dataframe to segregate this data to multiple column example,
Data in required format
Can anyone help me out.
I have tried to split it but it does not work,
df3=df.technische_daten.str.split('\s+(?=\<\/[a-z]+\>)', expand=True)
df3[0]=df3[0].str.replace(r'<li>', '', regex=False)
df3[1]=df3[1].str.replace(r'</li> ','',regex=True)
Data Snippet:
<ul><li>Höhe: 248 mm</li><li>Länge: 297 mm</li><li>Breite: 246 mm</li><li>Gewicht: 4,0 kg</li><li>Leerlaufdrehzahl: 5500 U/min</li><li>Sägeblattdurchmesser: 190 mm</li><li>Leistungsaufnahme: 1400 Watt</li><li>Standard: 821552-6,B-02939,195837-9,164095-8</li><li>Bohrung: 30 mm</li><li>Schnittleistung 45°: 48,5 mm</li><li>Vibration Sägen Holz: 2,5 m/s²</li><li>Schnittleistung 0°: 67 mm</li><li>Sägeblatt-Ø / Bohrung: 190/30 mm</li><li>Max. Schnitttiefe 90°: 67 mm</li><li>Schnittleistung 0°/45°: 67/48,5 mm</li></ul>
Helpfully it's a built in function of pandas, which will make a HTML table for you.
import pandas as pd
df_rows = []
# put the below in a for loop to get all of your rows
# rows = all_your_data
# for row in rows:
# remove this line and use the above for loop
row = "<ul><li>Höhe: 248 mm</li><li>Länge: 297 mm</li><li>Breite: 246 mm</li><li>Gewicht: 4,0 kg</li><li>Leerlaufdrehzahl: 5500 U/min</li><li>Sägeblattdurchmesser: 190 mm</li><li>Leistungsaufnahme: 1400 Watt</li><li>Standard: 821552-6,B-02939,195837-9,164095-8</li><li>Bohrung: 30 mm</li><li>Schnittleistung 45°: 48,5 mm</li><li>Vibration Sägen Holz: 2,5 m/s²</li><li>Schnittleistung 0°: 67 mm</li><li>Sägeblatt-Ø / Bohrung: 190/30 mm</li><li>Max. Schnitttiefe 90°: 67 mm</li><li>Schnittleistung 0°/45°: 67/48,5 mm</li></ul>"
values = row.split("</li><li>")
# clean the data
values[0] = values[0].replace("<ul><li>", "")
values[-1] = values[-1].replace("</li></ul>", "")
dict_of_values = {}
for value in values:
dict_of_values[value.split(": ")[0]] = value.split(": ")[1]
df_rows.append(dict_of_values)
# outside of for loop
df = pd.DataFrame.from_dict(df_rows, orient='columns')
# use df.drop to remove any columns you do not need
df = df.drop(['Leerlaufdrehzahl', 'Sägeblattdurchmesser'], axis=1)
your_html = df.to_html()
Hopefully this helps.
pretty new to python and pandas in general. I have a dataframe that has 2 columns i want to analyze:
df2 = df1.query('debt == 1').groupby(['family_status'])['family_status'].describe()
This gives me the result of:
count unique top freq
family_status
civil partnership 388 1 civil partnership 388
divorced 85 1 divorced 85
married 931 1 married 931
unmarried 274 1 unmarried 274
widow / widower 63 1 widow / widower 63
which is a lot of information that i wanted to know - however, to do some additional analysis i'd like to be able to put these results by 'family_status' into variables - so, civil partnership, divorced, married, etc. or feed them into an ad-hoc function.
Edit for clarity - i want to have civil_partnership = the count (in this case 388) etc.
unsure how to proceed.
thanks for your time in advance,
Jared
.describe() returns a normal dataframe, so you can assign it to amy variable and apply any dataframe methods in it.
And to select by index use .ix
It would be easy to create a dictionary and call each faimily_status as a key in that dict. Since you've assigned your describe dataframe to df2:
df2.reset_index(inplace=True, drop=False)
d = {}
for status, count in zip(df2['family_status'], df2['count']):
d[status] = count
This should result in something like
d = {
'civil partnership' : 388
'divorced' : 85
'married' : 931
'unmarried' : 274
'widow / widower' : 63
}
edit:
df2.count invokes the count method -- syntax adjusted so that it calls the column, not the method.
I'm new to Pandas and trying to put together training data for a neural net problem.
Essentially, I have 2 DataFrames:
One DataFrame has a column for the primary_key and 3 columns for 3 different positions (sports positions, for this example assume First Base, Second Base, Third Base if you'd like). Each position has the player ID's for the player in that position.
On a second DataFrame, I have various statistics for each player like Height and Weight.
My ultimate goal is to add columns from the second DataFrame to the first DataFrame so that each position has the associated Height and Weight for a particular player represented as columns. Then, I'm going to export this DataFrame as a csv, arrange the columns in a particular order, and use that for my training data, where each column is a training feature and each row is a training set. I've worked out a solution, but I'm wondering if I'm doing it in the most efficient manner possible, fully utilizing Pandas functions and features.
Here's what my code looks like:
****EDIT: I should point out, this is just a simplification of what my code looks like. In reality, my DataFrames are being pulled from CSVs, not constructed from dictionaries created by myself. ****
import pandas as pd
dict_1 = {'primary_key' : ['a', 'b', 'c', 'd'],
'position_1_ID' : ['ida', 'idb', 'idc', 'idd'],
'position_2_ID' : ['ide', 'idb', 'idg', 'idd'],
'position_3_ID' : ['idg', 'idf', 'idc', 'idh']
}
dict_2 = {'position_ID' : ['ida', 'idb', 'idc', 'idd', 'ide', 'idf', 'idg', 'idh'],
'Height' : ['70', '71', '72', '73', '74', '75', '76', '77'],
'Weight' : ['200', '201', '202', '203', '204', '205', '206', '207']
}
positions = pd.DataFrame(dict_1)
players = pd.DataFrame(dict_2)
position_columns = ['position_1_ID', 'position_2_ID', 'position_3_ID']
carry = positions
previous = None
for p in position_columns:
merged = carry.merge(right = players, left_on = p, right_on = 'position_ID', suffixes = [previous, p] )
carry = merged
previous = p
carry.to_csv()
After this code runs, I have a DataFrame which contains the following columns:
'primary_key'
'position_1_ID'
'position_2_ID'
'position_3_ID'
'position_IDposition_1_ID'
'position_IDposition_2_ID'
'position_IDposition_3_ID'
'Heightposition_1_ID'
'Weightposition_1_ID'
'Heightposition_2_ID'
'Weightposition_2_ID'
'Heightposition_3_ID'
'Weightposition_3_ID'
It's not pretty, but this gives me the ability to eventually export a csv with a particular column order, and it doesn't take a prohibitively long time to produce the DataFrame.
That being said, I'm doing this project partially to learn Pandas. I would like to see if there are cleaner ways to do this.
Thanks!
You can use melt, merge and unstack:
df_out = carry.melt('primary_key')\
.merge(players, left_on='value', right_on='position_ID')\
.set_index(['primary_key','variable'])\
.drop('value', axis=1)\
.unstack()
df_out.columns = [f'{i}{j}' if i != 'position_ID' else f'{i}' for i,j in df_out.columns]
print(df_out)
Output:
position_ID position_ID position_ID Heightposition_1_ID Heightposition_2_ID Heightposition_3_ID Weightposition_1_ID Weightposition_2_ID Weightposition_3_ID
primary_key
a ida ide idg 70 74 76 200 204 206
b idb idb idf 71 71 75 201 201 205
c idc idg idc 72 76 72 202 206 202
d idd idd idh 73 73 77 203 203 207
height_dict = {k:v for k, v in zip(dict_2['position_ID'], dict_2['Height'])}
weight_dict = {k:v for k, v in zip(dict_2['position_ID'], dict_2['Weight'])}
positions = pd.DataFrame(dict_1)
positions['p1_height'] = positions['position_ID1'].map(height_dict)
Similar steps for all the 3 ids for both height and weight.
You can loop, instead of writing repeated similar steps.
Hope this helps.
positions.to_csv()