Complex Transformation of GB to TB in mixed column using Python

Complex Transformation of GB to TB in mixed column using Python - python

I have a dataframe, df, that looks like this:
TB TB2
4.6 5.0
6.8 502.4 G
My desired output is to convert any value that has the letter G behind it to TB, without disturbing the other values within that column.
1000 Gigabytes = 1 Terabyte
TB TB2
4.6 5.0
6.8 0.5024
A member has suggested the following code:
df['Partial_Capacity TB']=df['Partial_Capacity TB'].str.replace('\s\w+','').astype(float).div(1000)
However, all of the values within the column are being converted, regardless of the 'G' behind it.
I am working on this now, any suggestions are appreciated

#Select rows containing G
m=df.TB2.str.contains('G')
#Use the loc accessor to mask relevant column, strip G from string and divide by 1000
df.loc[m,'TB2']=df.loc[m,'TB2'].str.replace('\s\w+','').astype(float).div(1000)
print(df)
TB TB2
0 4.6 5.0
1 6.8 0.5024

Related

Pandas groupby - dataframe's column disappearing

I have the following data frame called "new_df":
dato uttak annlegg Merd ID Leng BW CF F B H K
0 2020-12-15 12_20 LL 3 1 48.0 1200 1.085069 0.0 2.0 0.0 NaN
1 2020-12-15 12_20 LL 3 2 43.0 830 1.043933 0.0 1.0 0.0 NaN
columns are:
'dato', 'uttak', 'annlegg', 'Merd', 'ID', 'Leng', 'BW', 'CF', 'F', 'B', 'H', 'K'
when I do:
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
I got all means except the column "BW" like this:
annlegg Merd ID Leng CF F B H K
0 KH 1 42.557143 56.398649 1.265812 0.071770 1.010638 0.600000 0.127907
1 KH 2 42.683794 56.492228 1.270522 0.021978 0.739130 0.230769 0.075862
2 KH 3 42.177866 35.490119 1.125416 0.000000 0.384146 0.333333 0.034483
Column "BW" just disappeared when I groupby, no matter "as_index" True or False, why is that?

It appears the content as the BW column does not have a numerical type but an object type instead, which is used for storing strings for instance. Thus when applying groupby and meanaggregation function, tour column disappears has computing the mean value of an object (think of a string does not make sense in general).
You should start by converting your BW column :
First method : pd.to_numeric
This first method will safely convert all your column to float objects.
new_df['BW'] = pd.to_numeric(new_df['BW'])
Second method : df.astype
If you do not want to convert your data to float (for instance, you know that this column only contains int, or if floating point precision does not interest you), you can use the astype method which allows you to convert to almost any type you want :
new_df['BW'] = new_df['BW'].astype(float) # Converts to float
new_df['BW'] = new_df['BW'].astype(int) # Converts to integer
You can eventually apply your groupby and aggregation as you did !

That's probably due to the wrong data type. You can try this.
new_df = new_df.convert_dtypes()
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
You can check dtype via:
new_df.dtype

You can try .agg() function to target specific columns.
new_df.groupby(['annlegg','Merd']).agg({'BW':'mean'})

Pandas join.fillna of two data frames replaces all all values with anf not only nan

The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object

Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)

If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)

Create multiple dataframe columns containing calculated values from an existing column

I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.

One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)

Data calculation in pandas python

I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.

Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.

Rounding down values in Pandas dataframe column with NaNs

I have a Pandas dataframe that contains a column of float64 values:
tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76,91,91,91,91],
'measure': [3.2,4.2,6.8,5.6,3.1,4.8,8.8,3.0,1.9,2.1,2.4,3.5,4.2,5.2,4.3,3.6,5.2,7.1,6.5,7.3]})
I want to create a new column containing just the integer part. My first thought was to use .astype(int):
tempDF['int_measure'] = tempDF['measure'].astype(int)
This works fine but, as an extra complication, the column I have contains a missing value:
tempDF.ix[10,'measure'] = np.nan
This missing value causes the .astype(int) method to fail with:
ValueError: Cannot convert NA to integer
I thought I could round down the floats in the column of data. However, the .round(0) function will round to the nearest integer (higher or lower) rather than rounding down. I can't find a function equivalent to ".floor()" that will act on a column of a Pandas dataframe.
Any suggestions?

You could just apply numpy.floor;
import numpy as np
tempDF['int_measure'] = tempDF['measure'].apply(np.floor)
id measure int_measure
0 12 3.2 3
1 12 4.2 4
2 12 6.8 6
...
9 51 2.1 2
10 51 NaN NaN
11 51 3.5 3
...
19 91 7.3 7

You could also try:
df.apply(lambda s: s // 1)
Using np.floor is faster, however.

The answers here are pretty dated and as of pandas 0.25.2 (perhaps earlier) the error
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Which would be
df.iloc[:,0] = df.iloc[:,0].astype(int)
for one particular column.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Complex Transformation of GB to TB in mixed column using Python - python

#Select rows containing G m=df.TB2.str.contains('G') #Use the loc accessor to mask relevant column, strip G from string and divide by 1000 df.loc[m,'TB2']=df.loc[m,'TB2'].str.replace('\s\w+','').astype(float).div(1000) print(df) TB TB2 0 4.6 5.0 1 6.8 0.5024

Related

Pandas groupby - dataframe's column disappearing

Pandas join.fillna of two data frames replaces all all values with anf not only nan

Create multiple dataframe columns containing calculated values from an existing column

Data calculation in pandas python

Rounding down values in Pandas dataframe column with NaNs

Categories

Resources