Pandas groupby - dataframe's column disappearing - python

I have the following data frame called "new_df":
dato uttak annlegg Merd ID Leng BW CF F B H K
0 2020-12-15 12_20 LL 3 1 48.0 1200 1.085069 0.0 2.0 0.0 NaN
1 2020-12-15 12_20 LL 3 2 43.0 830 1.043933 0.0 1.0 0.0 NaN
columns are:
'dato', 'uttak', 'annlegg', 'Merd', 'ID', 'Leng', 'BW', 'CF', 'F', 'B', 'H', 'K'
when I do:
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
I got all means except the column "BW" like this:
annlegg Merd ID Leng CF F B H K
0 KH 1 42.557143 56.398649 1.265812 0.071770 1.010638 0.600000 0.127907
1 KH 2 42.683794 56.492228 1.270522 0.021978 0.739130 0.230769 0.075862
2 KH 3 42.177866 35.490119 1.125416 0.000000 0.384146 0.333333 0.034483
Column "BW" just disappeared when I groupby, no matter "as_index" True or False, why is that?

It appears the content as the BW column does not have a numerical type but an object type instead, which is used for storing strings for instance. Thus when applying groupby and meanaggregation function, tour column disappears has computing the mean value of an object (think of a string does not make sense in general).
You should start by converting your BW column :
First method : pd.to_numeric
This first method will safely convert all your column to float objects.
new_df['BW'] = pd.to_numeric(new_df['BW'])
Second method : df.astype
If you do not want to convert your data to float (for instance, you know that this column only contains int, or if floating point precision does not interest you), you can use the astype method which allows you to convert to almost any type you want :
new_df['BW'] = new_df['BW'].astype(float) # Converts to float
new_df['BW'] = new_df['BW'].astype(int) # Converts to integer
You can eventually apply your groupby and aggregation as you did !

That's probably due to the wrong data type. You can try this.
new_df = new_df.convert_dtypes()
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
You can check dtype via:
new_df.dtype

You can try .agg() function to target specific columns.
new_df.groupby(['annlegg','Merd']).agg({'BW':'mean'})

Related

pandas fillna sequentially step by step

I have dataframe like as below
Re_MC,Fi_MC,Fin_id,Res_id,
1,2,3,4
,7,6,11
11,,31,32
,,35,38
df1 = pd.read_clipboard(sep=',')
I would like to fillna based on two steps
a) First, compare only Re_MC and Fi_MC. If a value is missing in either of these columns, copy it from the other column.
b) Despite doing step a, if there is still NA for either Re_MC or Fi_MC, copy values from Fin_id for Fi_MC and Res_id for Re_MC.
So, I tried the below two approaches
Approach 1 - This works but not efficient/elegant
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC'])
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Fin_id'])
Approach 2 - This doesn't work and provide incorrect output
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC']).fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC']).fillna(df1['Fin_id'])
Is there any other efficient way to fillna in a sequential manner? Meaning, we do step a first and then based on result of step a, we do step b
I expect my output to be like as shown below
updated code
df_new = (df_new
.fillna({'Re MC': df_new['Re Cust'],'Re MC': df_new['Re Cust_System']})
.fillna({'Fi MC' : df_new['Fi.Fi Customer'],'Final MC':df_new['Re.Fi Customer']})
.fillna({'Fi MC' : df_new['Re MC']})
.fillna({'Class Fi MC':df_new['Re MC']})
)
You can use dictionaries in fillna:
(df1
.fillna({'Re_MC': df1['Fi_MC'], 'Fi_MC': df1['Re_MC']})
.fillna({'Re_MC': df1['Res_id'], 'Fi_MC': df1['Fin_id']})
)
output:
Re_MC Fi_MC Fin_id Res_id
0 1.0 2.0 3 4
1 7.0 7.0 6 11
2 11.0 11.0 31 32
3 38.0 35.0 35 38

Pandas join.fillna of two data frames replaces all all values with anf not only nan

The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)

Removing dash string from mixed dtype column in pandas Dataframe

I have a dataframe with possible objects mixed with numerical values.
My target is to change every value to a simple integer, however, some of these values have - between numbers.
A minimal working example looks like:
import pandas as pd
d = {'API':[float(4433), float(3344), 6666, '6-9-11', '8-0-11', 9990]}
df = pd.DataFrame(d)
I try:
df['API'] = df['API'].str.replace('-','')
But this leaves me with nan for the numeric types because it's searching the entire frame for the strings only.
The output is:
API
nan
nan
nan
6911
8011
nan
I'd like an output:
API
4433
3344
6666
6911
8011
9990
Where all types are int.
Is there an easy way to take care of just the object types in the Series but leaving the actual numericals in tact? I'm using this technique on large data sets (300,000+ lines) so something like lambda or series operations would be preferred over a loop search.
Use df.replace with regex=True
df = df.replace('-', '', regex=True).astype(int)
API
0 4433
1 3344
2 6666
3 6911
4 8011
5 9990
Also,
df['API'] = df['API'].astype(str).apply(lambda x: x.replace('-', '')).astype(int)

Convert DataFrame with 'N/As' to float to compute percent change

I am trying convert the following DataFrame (contains several 'N/As') to float so that I can perform a percent change operation:
d = pd.DataFrame({"A":['N/A','$10.00', '$5.00'],
"B":['N/A', '$10.00', '-$5.00']})
Ultimately, I would like the result to be:
(UPDATE: I do not want to remove the original N/A values. I'd like to keep them there as placeholders.)
Because there aren't any flags for dealing with negative numbers, I cannot use:
pct_change(-1)
So, I need to use:
d['A'].diff(-1)/d['A'].shift(-1).abs()
But, I get the error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
For a first step, I am trying to convert the data from object/string to float, but the output is unexpected (to me). I am getting float 'NaNs' instead of the actual number.
>d['A_float'] = pd.to_numeric(d['A'], errors='coerce')
>d
A B A_float
0 N/A N/A NaN
1 $10.00 -$100.00 NaN
2 $5.00 -$5.00 NaN
>d.dtypes
A object
B object
A_float float64
dtype: object
As a simple test, I tried subtracting '1' from the value, but still got float 'NaN'.
>d['A_float_minus1_test'] = pd.to_numeric(d['A'], errors='coerce')-1
>d
A B A_float A_float_minus1_test
0 N/A N/A NaN NaN
1 $10.00 -$100.00 NaN NaN
2 $5.00 -$5.00 NaN NaN
>d.dtypes
A object
B object
A_float float64
A_float_minus1_test float64
dtype: object
Is there a simple way to get the following result? The way I am thinking is to individually change each DataFrame column to float, then perform the operation. There must be an easier way.
Desired output:
(UPDATE: I do not want to remove the original N/A values. I'd like to keep them there as placeholders.)
Thanks!
To convert your columns from string to float, you can use apply, like such:
d['A_float'] = d['A'].apply(lambda x: float(x.split('$')[1]) if x != '' else 0.0)
The x.split('$')[1] is used to remove the $ character (and eventually the minus before).
Then I am not sure of what your are trying to do, but if you are trying to compute the percentage of A from B, you can use np.vectorize like this:
d['Percent'] = np.vectorize(percent)(d['A'],d['B'])
def percent(p1, p2):
return (100 * p2) / p1
import pandas as pd
d = pd.DataFrame({"A":['N/A','$10.00', '$5.00'],
"B":['N/A', '$10.00', '-$5.00']})
# Covert to number, remove '$', assign to new columns
d[['dA','dB']] = d[['A','B']].apply(lambda s: s.str.replace('$','')).apply(pd.to_numeric, errors='coerce')
# Perform calculations across desired column
d[['dA','dB']] = d[['dA','dB']].diff(-1)/d[['dA','dB']].shift(-1).abs()
print(d)
A B dA dB
0 N/A N/A NaN NaN
1 $10.00 $10.00 1.0 3.0
2 $5.00 -$5.00 NaN NaN

Rounding down values in Pandas dataframe column with NaNs

I have a Pandas dataframe that contains a column of float64 values:
tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76,91,91,91,91],
'measure': [3.2,4.2,6.8,5.6,3.1,4.8,8.8,3.0,1.9,2.1,2.4,3.5,4.2,5.2,4.3,3.6,5.2,7.1,6.5,7.3]})
I want to create a new column containing just the integer part. My first thought was to use .astype(int):
tempDF['int_measure'] = tempDF['measure'].astype(int)
This works fine but, as an extra complication, the column I have contains a missing value:
tempDF.ix[10,'measure'] = np.nan
This missing value causes the .astype(int) method to fail with:
ValueError: Cannot convert NA to integer
I thought I could round down the floats in the column of data. However, the .round(0) function will round to the nearest integer (higher or lower) rather than rounding down. I can't find a function equivalent to ".floor()" that will act on a column of a Pandas dataframe.
Any suggestions?
You could just apply numpy.floor;
import numpy as np
tempDF['int_measure'] = tempDF['measure'].apply(np.floor)
id measure int_measure
0 12 3.2 3
1 12 4.2 4
2 12 6.8 6
...
9 51 2.1 2
10 51 NaN NaN
11 51 3.5 3
...
19 91 7.3 7
You could also try:
df.apply(lambda s: s // 1)
Using np.floor is faster, however.
The answers here are pretty dated and as of pandas 0.25.2 (perhaps earlier) the error
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Which would be
df.iloc[:,0] = df.iloc[:,0].astype(int)
for one particular column.

Categories