I am trying to apply the Python explode function to unpack a few columns that are | delimited. Each row will be the same | delimited length (e.g. A will have the same |s as B) but rows can have different lengths from one another (e.g. row 1 is length 3 and rows 2 is length 2).
There are some rows where there may be an NaN here and there (e.g. A and C) which is causing the following error "columns must have matching element counts"
Current data:
A | B | C
1 | 2 | 3 app | ban | cor NaN
4 | 5 dep | exp NaN
NaN for | gep NaN
Expected output:
A | B | C
1 app NaN
2 ban NaN
3 cor NaN
4 dep NaN
5 exp NaN
NaN for NaN
NaN gep NaN
cols = ['A','B','C']
for col in cols:
df_test[col] = df_test[col].str.split('|')
df_test[col] = df_test[col].fillna({i: [] for i in df_test.index}) #tried replace the NaN with a null list but same error
df_long = df_test.explode(cols)
Related
Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a
I have the following pandas dataframe
| A | B |
| :-|:------:|
| 1 | [2,3,4]|
| 2 | np.nan |
| 3 | np.nan |
| 4 | 10 |
I would like to unlist the first row and place those values sequentially in the subsequent rows. The outcome will look like this:
| A | B |
| :-|:------:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 10 |
How can I achieve this in a very large dataset with this phenomena occurring in many rows?
If the number of NaN values serve as a "slack" space, so that list elements can slot in, i.e. if the lengths match, then you can explode columns "B", then drop NaN values with dropna, reset index and assign back to "B":
df['B'] = df['B'].explode().dropna().reset_index(drop=True)
Output:
A B
0 1 2
1 2 3
2 3 4
3 4 10
As the number of consecutive NaNs does not match the length of the list, you can make groups starting with non NaN elements and explode while keeping the length of the group constant.
I used a slightly different example for clarity (I also assigned to a different column):
df['C'] = (df['B']
.groupby(df['B'].notna().cumsum())
.apply(lambda s: s.explode().iloc[:len(s)])
.values
)
Output:
A B C
0 1 [2, 3, 4] 2
1 2 NaN 3
2 3 NaN 4
3 4 NaN NaN
4 5 10 10
Used input:
df = pd.DataFrame({'A': range(1,6),
'B': [[2,3,4], np.nan, np.nan, np.nan, 10]
})
I have a concept of what I need to do, but I can't write the right code to run, please take a look and give some advice.
step 1. find the rows that contains values in the second column
step 2. with those rows, compare the value in the first column with their previous row
step 3. drop the rows with larger first column value
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | 60 |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 1 | 180 |
| 1 | nan |
| 0 | 120 |
eg. I want to compare the missing values with the rows values in diff [120,180,60] and their previous rows. in the end, the desire dataframe will look like
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 0 | 120 |
update question according to the answer, got the same df as original df
import pandas as pd
import numpy as np
data={'missing':[0,1,1,0,0,1,1,0],'diff':[np.nan,60,np.nan,np.nan,np.nan,180,np.nan,120]}
df=pd.DataFrame(data)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
if df['diff'][ind]!=np.nan:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
IIUC, you can try:
m = df['diff'].notna()
df = (
pd.concat([
df[df['diff'].isna()],
df[m][df[m.shift(-1).fillna(False)]['missing'].values >
df[m]['missing'].values]
])
)
OUTPUT:
missing diff
1 0 <NA>
3 1 <NA>
4 0 <NA>
5 0 <NA>
7 1 <NA>
8 0 120
This will work for sure
for ind in df.index:
if np.isnan(df['diff'][ind])==False:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
This will work
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
import pandas as pd #import pandas
#define dictionary
data={'missing':[0,1,1,0,0,1,1,0],'diff':[nan,60,nan,nan,nan,180,nan,120]}
#dictionary to dataframe
df=pd.DataFrame(data)
print(df)
#for each row in dataframe
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
#only each row whose diff value is a number
#find the rows that contains values in the second column and compare it with previous value
if df['missing'][ind]>df['missing'][ind-1]:
#drop the rows with larger first column value
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
print(df)
Given the following pandas dataframe
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | Older than 100 | Older than 100 | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
how can I replace values in specific columns which equal Older than 100 with nan
+----+------------------+-------------------------------------+--------------------------------+
| | AgeAt_X | AgeAt_Y | AgeAt_Z |
|----+------------------+-------------------------------------+--------------------------------+
| 0 | nan | nan | 74.13 |
| 1 | nan | nan | 58.46 |
| 2 | nan | 8.4 | 54.15 |
| 3 | nan | nan | 57.04 |
| 4 | nan | 57.04 | nan |
+----+------------------+-------------------------------------+--------------------------------+
Notes
After removing the Older than 100 string from the desired columns, I convert the columns to numeric in order to perform calculations on said columns.
There are other columns in this dataframe (that I have excluded from this example), which will not be converted to numeric, so the conversion to numeric must be done one column at a time.
What I've tried
Attempt 1
if df.isin('Older than 100'):
df.loc[df['AgeAt_X']] = ''
else:
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 2
if df.loc[df['AgeAt_X']] == 'Older than 100r':
df.loc[df['AgeAt_X']] = ''
elif df.loc[df['AgeAt_X']] == '':
df['AgeAt_X'] = pd.to_numeric(df["AgeAt_X"])
Attempt 3
df['AgeAt_X'] = ['' if ele == 'Older than 100' else df.loc[df['AgeAt_X']] for ele in df['AgeAt_X']]
Attempts 1, 2 and 3 return the following error:
KeyError: 'None of [0 NaN\n1 NaN\n2 NaN\n3 NaN\n4 NaN\n5 NaN\n6 NaN\n7 NaN\n8 NaN\n9 NaN\n10 NaN\n11 NaN\n12 NaN\n13 NaN\n14 NaN\n15 NaN\n16 NaN\n17 NaN\n18 NaN\n19 NaN\n20 NaN\n21 NaN\n22 NaN\n23 NaN\n24 NaN\n25 NaN\n26 NaN\n27 NaN\n28 NaN\n29 NaN\n ..\n6332 NaN\n6333 NaN\n6334 NaN\n6335 NaN\n6336 NaN\n6337 NaN\n6338 NaN\n6339 NaN\n6340 NaN\n6341 NaN\n6342 NaN\n6343 NaN\n6344 NaN\n6345 NaN\n6346 NaN\n6347 NaN\n6348 NaN\n6349 NaN\n6350 NaN\n6351 NaN\n6352 NaN\n6353 NaN\n6354 NaN\n6355 NaN\n6356 NaN\n6357 NaN\n6358 NaN\n6359 NaN\n6360 NaN\n6361 NaN\nName: AgeAt_X, Length: 6362, dtype: float64] are in the [index]'
Attempt 4
df['AgeAt_X'] = df['AgeAt_X'].replace({'Older than 100': ''})
Attempt 4 returns the following error:
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
I've also looked at a few posts. The two below do not actually replace the value but create a new column derived from others
Replace specific values in Pandas DataFrame
Pandas replace DataFrame values
We can loop through each column and check if the sentence is present. If we get a hit, we replace the sentence with NaN with Series.str.replace and right after convert it to numeric with Series.astype, in this case float:
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
sent = 'Older than 100'
for col in df.columns:
if sent in df[col].values:
df[col] = df[col].str.replace(sent, 'NaN')
df[col] = df[col].astype(float)
print(df)
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.40 54.15
3 NaN NaN 57.04
4 NaN 57.04 NaN
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object
If I understand you correctly, you can replace all occurrences of Older than 100 with np.nan with a single call to DataFrame.replace. If all remaining values are numeric, then the replace will implicitly change the data type of the column to numeric:
# Minimal example DataFrame
df = pd.DataFrame({'AgeAt_X': ['Older than 100', np.nan, np.nan],
'AgeAt_Y': ['Older than 100', np.nan, 8.4],
'AgeAt_Z': [74.13, 58.46, 54.15]})
df
AgeAt_X AgeAt_Y AgeAt_Z
0 Older than 100 Older than 100 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X object
AgeAt_Y object
AgeAt_Z float64
dtype: object
# Replace occurrences of 'Older than 100' with np.nan in any column
df.replace('Older than 100', np.nan, inplace=True)
df
AgeAt_X AgeAt_Y AgeAt_Z
0 NaN NaN 74.13
1 NaN NaN 58.46
2 NaN 8.4 54.15
df.dtypes
AgeAt_X float64
AgeAt_Y float64
AgeAt_Z float64
dtype: object
I have a dataframe which looks like:
PRIO Art Name Value
1 A Alpha 0
1 A Alpha 0
1 A Beta 1
2 A Alpha 3
2 B Theta 2
How can I transpose the dataframe, that I have all unique names as a column with the corresponding values to it (note that duplicate rows I want to ignore)?
So in this case:
PRIO Art Alpha Alpha_value Beta Beta_value Theta Theta_value
1 A 1 0 1 1 NaN NaN
2 A 1 3 NaN NaN NaN NaN
2 B NaN NaN NaN NaN 1 2
Here's one way using pivot_table. A few tricky things to keep in mind:
You need to specify both 'PRIO', 'Art' as pivot index
We can also use two aggregation funcs to get it done in a single call
We have to rename the level 0 columns to distinguish them. So you need to swap levels and rename
out = df.pivot_table(index=['PRIO', 'Art'], columns='Name', values='Value',
aggfunc=[lambda x: 1, 'first'])
# get the column names right
d = {'<lambda>':'is_present', 'first':'value'}
out = out.rename(columns=d, level=0)
out.columns = out.swaplevel(1,0, axis=1).columns.map('_'.join)
print(out.reset_index())
PRIO Art Alpha_is_present Beta_is_present Theta_is_present Alpha_value \
0 1 A 1.0 1.0 NaN 0.0
1 2 A 1.0 NaN NaN 3.0
2 2 B NaN NaN 1.0 NaN
Beta_value Theta_value
0 1.0 NaN
1 NaN NaN
2 NaN 2.0
Groupby twice, first to pivot Name and suffix with value. Next groupby same imperatives and find unique values. Join the two.In the joining, drop the duplicate columns and rename others as appropriate
g=df.groupby([ 'Art','PRIO', 'Name'])['Value'].\
first().unstack().reset_index().add_suffix('_value')
print(g.join(df.groupby(['PRIO', 'Art','Name'])['Value'].\
nunique().unstack('Name').reset_index()).drop(columns=['PRIO_value','Art'])\
.rename(columns={'Art_value':'Art'}))
Name Art Alpha_value Beta_value Theta_value PRIO Alpha Beta Theta
0 A 0.0 1.0 NaN 1 1.0 1.0 NaN
1 A 3.0 NaN NaN 2 1.0 NaN NaN
2 B NaN NaN 2.0 2 NaN NaN 1.0
This is an example of pd.crosstab() and groupby().
df = pd.concat([pd.crosstab([df['PRIO'],df['Art']], df['Name']),df.groupby(['PRIO','Art','Name'])['Value'].sum().unstack().add_suffix('_value')],axis=1).reset_index()
df
| | Alpha | Beta | Theta | Alpha_value | Beta_value | Theta_value |
|:---------|--------:|-------:|--------:|--------------:|-------------:|--------------:|
| (1, 'A') | 1 | 1 | 0 | 0 | 1 | nan |
| (2, 'A') | 1 | 0 | 0 | 3 | nan | nan |
| (2, 'B') | 0 | 0 | 1 | nan | nan | 2 |