How to stack number of rows to one row and assign id - python

I have a dataframe likes this:
band mean raster
1 894.343482 D:/Python/Copied/selection/20170219_095504.tif
2 1159.282304 D:/Python/Copied/selection/20170219_095504.tif
3 1342.291595 D:/Python/Copied/selection/20170219_095504.tif
4 3056.809463 D:/Python/Copied/selection/20170219_095504.tif
1 516.9624071 D:/Python/Copied/selection/20170325_095551.tif
2 720.1932533 D:/Python/Copied/selection/20170325_095551.tif
3 689.6287879 D:/Python/Copied/selection/20170325_095551.tif
4 4561.576329 D:/Python/Copied/selection/20170325_095551.tif
1 566.2016867 D:/Python/Copied/selection/20170527_095700.tif
2 812.9927101 D:/Python/Copied/selection/20170527_095700.tif
3 760.4621212 D:/Python/Copied/selection/20170527_095700.tif
4 5009.537164 D:/Python/Copied/selection/20170527_095700.tif
And I want to format it to this:
band1_mean band2_mean band3_mean band4_mean raster_name id
894.343482 1159.282304 1342.291595 3056.809463 20170219_095504.tif 1
516.9624071 720.1932533 689.6287879 4561.576329 20170325_095551.tif 2
566.2016867 812.9927101 760.4621212 5009.537164 20170527_095700.tif 3
All 4 bands belong to one raster and therefore the values have to be all in one row. I don't know how to stack them without having and key id for every raster.
Thanks!

this is a case of pivot:
# extract the raster name:
df['raster_name'] = df.raster.str.extract('(\d+_\d+\.tif)')
# pivot
new_df = df.pivot(index='raster_name', columns='band', values='mean')
# rename the columns:
new_df.columns = [f'band{i}_mean' for i in new_df.columns]
Output:
band1_mean band2_mean band3_mean band4_mean
raster_name
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164
You can reset_index on new_df if you want raster_name to be a normal column.

With df.pivot("raster", "band", "mean") you'd get
band 1 2 3 4
raster
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164

Related

Pandas - take multiple columns and transform them into a single column of dictionary objects?

I am trying to transform a DataFrame by combining extra columns into a dictionary.
my DataFrame will always have four columns, at least: record, yhat, residual, and hat, with additional columns in different cases.
My current df head looks like this:
record yhat residual hat RinvRes AOMstat
0 1 6.7272 -0.57130 0.04985 0.009825 0.02041
1 2 6.5568 0.19460 0.09771 -0.014930 -0.03078
2 3 6.5457 0.16190 0.09765 0.272800 0.56260
If we look at the top column, we see that there are 2 additional columns, RinvRes and AOMstat
record yhat residual hat RinvRes AOMstat
0 1 6.7272 -0.57130 0.04985 0.009825 0.02041
I would like to combine those columns into a dictionary, where the column name is a key in a dictionary, eg :
record yhat residual hat additional
0 1 6.7272 -0.57130 0.04985 {“RinvRes“: “0.2291E-01“, “AOMstat“ : “0.3224E-01“}
in one step with .join, .agg(dict) and .drop
first create your list of aggregate columns
agg_cols = ['RinvRes', 'AOMstat']
df1 = df.join(df[agg_cols].agg(dict,axis=1)\
.to_frame('additional')).drop(agg_cols,1)
print(df1)
record yhat residual hat additional
0 1 6.7272 -0.5713 0.04985 {'RinvRes': 0.009825, 'AOMstat': 0.02041}
1 2 6.5568 0.1946 0.09771 {'RinvRes': -0.01493, 'AOMstat': -0.03078}
2 3 6.5457 0.1619 0.09765 {'RinvRes': 0.2728, 'AOMstat': 0.5626}
IIUC, starting from the list of the 4 columns, you can get extra columns names wtih difference and use to_dict to aggregate them
# columns you have in common
keep_cols = ['record', 'yhat', 'residual', 'hat']
# get columns to agg into dict
extra_cols = df.columns.difference(keep_cols)
# create the result
new_df = (
df[keep_cols]
.assign(additional = df[extra_cols].agg(lambda x: x.to_dict(), axis=1))
)
print(new_df)
record yhat residual hat \
0 1 6.7272 -0.5713 0.04985
1 2 6.5568 0.1946 0.09771
2 3 6.5457 0.1619 0.09765
additional
0 {'AOMstat': 0.02041, 'RinvRes': 0.009825}
1 {'AOMstat': -0.03078, 'RinvRes': -0.01493}
2 {'AOMstat': 0.5626, 'RinvRes': 0.2728}
Although the above answers are more elegant and efficient, here's a more simplistic version:
rinvres = df['RinvRes'].values.tolist()
aomstat = df['AOMstat'].values.tolist()
df.drop(['RinvRes', 'AOMstat'], axis=1)
additional = []
for i in range(len(rinvres)):
add = {
'RinvRes': rinvres[i],
'AOMstat': aomstat[i]
}
additional.append(add)
df['additional'] = additional

Add new label column based on values in dictionary in python

I am new to python I have a data frame with different groups and titles. Now I want to add a column based on median for each group (grp_pred), but I am not sure how I can accomplish this.
This is how my df looks like
df
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 18-54 V18-54 M18-54
HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393 0.739303 0.380100 0.344279
MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374 0.880062 0.503115 0.352025
PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093 0.717562 0.384010 0.316514
RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674 0.838560 0.548663 0.255410
ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934 0.877049 0.557377 0.303279
median_dict =
{'18-34': 0.395992275,
'18-54': 0.79392129200000006,
'25-54': 0.64958055850000007,
'M18-34': 0.1171878905,
'M18-54': 0.27340067349999997,
'M25-54': 0.23422200100000001,
'V18-34': 0.2283782815,
'V18-54': 0.4497918595,
'V25-54': 0.37749252799999999}
required output
so basically I want to compare median values store in the dictionary across each title and then assign to a certain group if the value is equal to that specific median. e.g say if the median is 0.395992275 then pred_grp is 18-24 and so forth
df_out
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 18-54 V18-54 M18-54 pred_grp
HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393 0.739303 0.380100 0.344279 18-54
MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374 0.880062 0.503115 0.352025
PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093 0.717562 0.384010 0.316514
RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674 0.838560 0.548663 0.255410
ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934 0.877049 0.557377 0.303279
How would appreciate your help!!
Thanks in advance
Based on what I understood from comments , you can try creating a df of same structure from the dictionary as the input dataframe and then get the column which has the least difference:
u = df.set_index("title")
v = pd.DataFrame.from_dict(median_dict,orient='index').T.reindex(u.columns,axis=1)
df['pred_group'] = (u - v.to_numpy()).idxmin(axis=1).to_numpy()
print(df)
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 \
0 HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393
1 MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374
2 PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093
3 RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674
4 ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934
18-54 V18-54 M18-54 pred_group
0 0.739303 0.380100 0.344279 18-34
1 0.880062 0.503115 0.352025 18-34
2 0.717562 0.384010 0.316514 18-34
3 0.838560 0.548663 0.255410 M25-54
4 0.877049 0.557377 0.303279 25-54

How to convert a specific range of elements in a panda DataFrame into float numbers?

I have a panda Dataframe like the following
and this is the data:
0 1 2 3 4 5 6
0 Label Total/Target Jaccard Dice VolumeSimilarity FalseNegative FalsePositive
1 image-9003406 0.753958942196244 0.628584809743865 0.771939914928625 -0.0476974851707525 0.246041057803756 0.209200511636753
2 image-9007827 0.783266136200411 0.652181507072358 0.789479248231042 -0.015864625683349 0.216733863799589 0.204208282912204
3 image-9040390 0.797836181211824 0.611217035556112 0.758702300270988 0.0981000407411853 0.202163818788176 0.276772045623749
4 image-9047800 0.833585767007274 0.627592483537663 0.771191179469637 0.149701662401568 0.166414232992726 0.282513296651508
5 image-9054866 0.828860635279561 0.652709649240693 0.789866083907199 0.0940919370823063 0.171139364720439 0.245624253720476
6 image-9056363 0.795614053800371 0.658368025419615 0.793995078689519 0.00406974990730408 0.204385946199629 0.207617320977731
7 image-9068453 0.763313209747495 0.565848914378489 0.722737563225356 0.106314540359027 0.236686790252505 0.313742036740474
8 image-9085290 0.633747182342442 0.498166624744976 0.665035005475144 -0.0987391313269621 0.366252817657558 0.300427399066708
9 image-9087863 0.663537911271341 0.539359224086608 0.700758102003958 -0.112187081100769 0.336462088728659 0.257597937816249
10 image-9094865 0.667530629804239 0.556419610760253 0.714999485888594 -0.142222256073179 0.332469370195761 0.230263697338428
However, I need to convert the data which is starting from column #1 and row #1 as numbers, when it is saving into excel file, it is saving as string.
How to do that?
your help is appreciated
Use:
#set columns by first row
df.columns = df.iloc[0]
#set index by first column
df.index = df.iloc[:, 0]
#remove first row, first col and cast to floats
df = df.iloc[1:, 1:].astype(float)
print (df)
0 Total/Target Jaccard Dice VolumeSimilarity \
Label
image-9003406 0.753959 0.628585 0.771940 -0.047697
image-9007827 0.783266 0.652182 0.789479 -0.015865
image-9040390 0.797836 0.611217 0.758702 0.098100
image-9047800 0.833586 0.627592 0.771191 0.149702
image-9054866 0.828861 0.652710 0.789866 0.094092
image-9056363 0.795614 0.658368 0.793995 0.004070
image-9068453 0.763313 0.565849 0.722738 0.106315
image-9085290 0.633747 0.498167 0.665035 -0.098739
image-9087863 0.663538 0.539359 0.700758 -0.112187
image-9094865 0.667531 0.556420 0.714999 -0.142222
0 FalseNegative FalsePositive
Label
image-9003406 0.246041 0.209201
image-9007827 0.216734 0.204208
image-9040390 0.202164 0.276772
image-9047800 0.166414 0.282513
image-9054866 0.171139 0.245624
image-9056363 0.204386 0.207617
image-9068453 0.236687 0.313742
image-9085290 0.366253 0.300427
image-9087863 0.336462 0.257598
image-9094865 0.332469 0.230264

Python Pandas - Pivot a csv file into a specific format

It's my first attempt at using pandas. I really need help with pivot_table. None of the combinations I used seem to work.
I have a csv file like this:
Id Param1 Param2
1 -5.00138282776 2.04990620034E-08
1 -4.80147838593 2.01516989762E-08
1 -4.60159301758 1.98263165885E-08
1 -4.40133094788 1.94918392538E-08
1 -4.20143127441 1.91767686175E-08
1 -4.00122880936 1.88457374151E-08
2 -5.00141859055 6.88369405921E-09
2 -4.80152130127 6.77335965094E-09
2 -4.60163593292 6.65415056389E-09
2 -4.40139055252 6.54434062497E-09
3 -5.00138044357 1.16316911658E-08
3 -4.80148792267 1.15515588206E-08
3 -4.60160970688 1.14048361866E-08
3 -4.40137386322 1.12357021465E-08
3 -4.20145988464 1.11049178741E-08
I want my final output to be like this:
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 Param1_for_Id3 Param2_for_Id3
-5.00138282776 2.04990620034E-08 -5.00141859055 6.88369405921E-09 -5.00138044357 1.16316911658E-08
-4.80147838593 2.01516989762E-08 -4.80152130127 6.77335965094E-09 -4.80148792267 1.15515588206E-08
-4.60159301758 1.98263165885E-08 -4.60163593292 6.65415056389E-09 -4.60160970688 1.14048361866E-08
-4.40133094788 1.94918392538E-08 -4.40139055252 6.54434062497E-09 -4.40137386322 1.12357021465E-08
-4.20143127441 1.91767686175E-08 -4.20145988464 1.11049178741E-08
-4.00122880936 1.88457374151E-08
I can't figure out how to reshape my data. Any help would be most welcome!
Use set_index X2 + unstack:
v = (df.set_index('Id') # optional, omit if `Id` is the index
.set_index(df.groupby('Id').cumcount(), append=True)
.unstack(0)
.sort_index(level=1, axis=1)
.fillna('') # I actually don't recommend adding this step in
)
v.columns = v.columns.map('{0[0]}_for_Id{0[1]}'.format)
And now,
print(v)
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 \
0 -5.001383 2.049906e-08 -5.00142 6.88369e-09
1 -4.801478 2.015170e-08 -4.80152 6.77336e-09
2 -4.601593 1.982632e-08 -4.60164 6.65415e-09
3 -4.401331 1.949184e-08 -4.40139 6.54434e-09
4 -4.201431 1.917677e-08
5 -4.001229 1.884574e-08
Param1_for_Id3 Param2_for_Id3
0 -5.00138 1.16317e-08
1 -4.80149 1.15516e-08
2 -4.60161 1.14048e-08
3 -4.40137 1.12357e-08
4 -4.20146 1.11049e-08
5
Note that that last fillna step results in mixed strings and numeric data, so I don't recommend adding that step if you're going to do more with this output.

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

Categories