I am new to python I have a data frame with different groups and titles. Now I want to add a column based on median for each group (grp_pred), but I am not sure how I can accomplish this.
This is how my df looks like
df
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 18-54 V18-54 M18-54
HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393 0.739303 0.380100 0.344279
MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374 0.880062 0.503115 0.352025
PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093 0.717562 0.384010 0.316514
RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674 0.838560 0.548663 0.255410
ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934 0.877049 0.557377 0.303279
median_dict =
{'18-34': 0.395992275,
'18-54': 0.79392129200000006,
'25-54': 0.64958055850000007,
'M18-34': 0.1171878905,
'M18-54': 0.27340067349999997,
'M25-54': 0.23422200100000001,
'V18-34': 0.2283782815,
'V18-54': 0.4497918595,
'V25-54': 0.37749252799999999}
required output
so basically I want to compare median values store in the dictionary across each title and then assign to a certain group if the value is equal to that specific median. e.g say if the median is 0.395992275 then pred_grp is 18-24 and so forth
df_out
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 18-54 V18-54 M18-54 pred_grp
HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393 0.739303 0.380100 0.344279 18-54
MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374 0.880062 0.503115 0.352025
PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093 0.717562 0.384010 0.316514
RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674 0.838560 0.548663 0.255410
ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934 0.877049 0.557377 0.303279
How would appreciate your help!!
Thanks in advance
Based on what I understood from comments , you can try creating a df of same structure from the dictionary as the input dataframe and then get the column which has the least difference:
u = df.set_index("title")
v = pd.DataFrame.from_dict(median_dict,orient='index').T.reindex(u.columns,axis=1)
df['pred_group'] = (u - v.to_numpy()).idxmin(axis=1).to_numpy()
print(df)
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 \
0 HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393
1 MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374
2 PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093
3 RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674
4 ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934
18-54 V18-54 M18-54 pred_group
0 0.739303 0.380100 0.344279 18-34
1 0.880062 0.503115 0.352025 18-34
2 0.717562 0.384010 0.316514 18-34
3 0.838560 0.548663 0.255410 M25-54
4 0.877049 0.557377 0.303279 25-54
Related
I have a dataframe likes this:
band mean raster
1 894.343482 D:/Python/Copied/selection/20170219_095504.tif
2 1159.282304 D:/Python/Copied/selection/20170219_095504.tif
3 1342.291595 D:/Python/Copied/selection/20170219_095504.tif
4 3056.809463 D:/Python/Copied/selection/20170219_095504.tif
1 516.9624071 D:/Python/Copied/selection/20170325_095551.tif
2 720.1932533 D:/Python/Copied/selection/20170325_095551.tif
3 689.6287879 D:/Python/Copied/selection/20170325_095551.tif
4 4561.576329 D:/Python/Copied/selection/20170325_095551.tif
1 566.2016867 D:/Python/Copied/selection/20170527_095700.tif
2 812.9927101 D:/Python/Copied/selection/20170527_095700.tif
3 760.4621212 D:/Python/Copied/selection/20170527_095700.tif
4 5009.537164 D:/Python/Copied/selection/20170527_095700.tif
And I want to format it to this:
band1_mean band2_mean band3_mean band4_mean raster_name id
894.343482 1159.282304 1342.291595 3056.809463 20170219_095504.tif 1
516.9624071 720.1932533 689.6287879 4561.576329 20170325_095551.tif 2
566.2016867 812.9927101 760.4621212 5009.537164 20170527_095700.tif 3
All 4 bands belong to one raster and therefore the values have to be all in one row. I don't know how to stack them without having and key id for every raster.
Thanks!
this is a case of pivot:
# extract the raster name:
df['raster_name'] = df.raster.str.extract('(\d+_\d+\.tif)')
# pivot
new_df = df.pivot(index='raster_name', columns='band', values='mean')
# rename the columns:
new_df.columns = [f'band{i}_mean' for i in new_df.columns]
Output:
band1_mean band2_mean band3_mean band4_mean
raster_name
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164
You can reset_index on new_df if you want raster_name to be a normal column.
With df.pivot("raster", "band", "mean") you'd get
band 1 2 3 4
raster
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164
Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)
The 2 dataframes I am comparing are of different size (have the same index though) and I suppose that is why I am getting the error. Can you please suggest me a way to get around that. I am looking for those rows in df2 whose user_id match with those of df1. Thanks and appreciate your response.
data = np.array([['user_id','comment','label'],
[100,'RT #Dvillain_: #oomf should text me.',0],
[100,'Buy viagra',1],
[101,'#nowplaying M.C. Shan - Juice Crew Law on',0],
[101,'Buy viagra two',1]])
data2 = np.array([['user_id','comment','label'],
[100,'First comment',0],
[100,'Buy viagra',1],
[102,'Buy viagra two',1]])
df1 = pd.DataFrame(data=data[1:,0:],columns = data[0,0:])
df2 = pd.DataFrame(data=data2[1:,0:],columns = data[0,0:])
df = df2[df2['user_id'] == df1['user_id']]
You are looking for isin
df = df2[df2['user_id'].isin(df1['user_id'])]
df
Out[814]:
user_id comment label
0 100 First comment 0
1 100 Buy viagra 1
In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).
this is the first data frame
Umls Snomed
C0027497/Nausea /Sign or Symptom Nausea (finding)[FN/422587007]
C0151786 / Muscle/Sign or Symptom Muscle weakness [(finding) /FN/26544005]
C2127305 /bitter/ Sign or Symptom ?
NA NA
I created a dictionary of it using the following code
df_dic_1= df_dic_1[['UMLS', 'snomed']]
df_dic_1['UMLS'].fillna(0, inplace=True)
df_dic_1['snomed'].fillna(0, inplace=True)
equiv_snomed=df_dic_1.set_index('UMLS')['snomed'].to_dict()
Now, for data frame B:
id symptom UMLS
1 nausea C0027497/Nausea /Sign or Symptom
2 muscle C2127305 /bitter/ Sign or Symptom
3 headache
4 pain
5 bitter C2127305 /bitter/ Sign or Symptom
For any value in "UMLS" column that is available in the dictionary, I want to create another column "Snomed" that includes "snomed" values from the dictionary. So data frame C should be like this:
id symptom UMLS Snomed
1 nausea C0027497/Nausea /Sign or Symptom Nausea (finding)[FN/422]
2 muscle C0151786 / Muscle/Sign or Symptom Muscle [(fi)/FN/25]
3 headache
4 pain
5 bitter C2127305 /bitter/ Sign or Symptom ?
Any help? thanks
You could use apply function for each element of your column UMLS and get the value from the dictionary equiv_snomed. if there is no key in the dictionary, you can return np.nan
if your data frame B is named df2. then
df2['Snomed'] = df2['UMLS'].apply(lambda x: equiv_snomed.get(x, np.nan))
See EdChum's answer to this Stack Overflow question.
As applied to your situation, it would look like:
import pandas as pd
# create dictionary
d = {'umls1':'snomed1','umls2':'snomed2','umls3':'snomed3'}
# create empty dataframe
columns = ['symptom','umls','snomed']
df = pd.DataFrame(columns = columns)
# fill it with symptoms and with umls, with some umls NULL
df['symptom'] = ['nausea','muscle','headache','pain','bitter']
df.ix[0,'umls'] = 'umls1'
df.ix[1,'umls'] = 'umls2'
df.ix[4,'umls'] = 'umls3'
# add a third column with snomed values from dictionary
df['snomed'] = df['umls'].map(d)
Giving the following output:
df.head()
Out[21]:
symptom umls snomed
0 nausea umls1 snomed1
1 muscle umls2 snomed2
2 headache NaN NaN
3 pain NaN NaN
4 bitter umls3 snomed3