How to expand a df by different dict as columns? - python

I have a df with different dicts as entries in a column, in my case column "information". I would like to expand the df by all possible dict.keys(), something like that:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5]),
'name': pd.Series(['banana',
'apple',
'orange',
'strawberry' ,
'toast']),
'information': pd.Series([{'shape':'curve','color':'yellow'},
{'color':'red'},
{'shape':'round'},
{'amount':500},
np.nan]),
'cost': pd.Series([1,2,2,10,4])})
id name information cost
0 1 banana {'shape': 'curve', 'color': 'yellow'} 1
1 2 apple {'color': 'red'} 2
2 3 orange {'shape': 'round'} 2
3 4 strawberry {'amount': 500} 10
4 5 toast NaN 4
Should look like this:
id name shape color amount cost
0 1 banana curve yellow NaN 1
1 2 apple NaN red NaN 2
2 3 orange round NaN NaN 2
3 4 strawberry NaN NaN 500.0 10
4 5 toast NaN NaN NaN 4

Another approach would be using pandas.DataFrame.from_records:
import pandas as pd
new = pd.DataFrame.from_records(df.pop('information').apply(lambda x: {} if pd.isna(x) else x))
new = pd.concat([df, new], 1)
print(new)
Output:
cost id name amount color shape
0 1 1 banana NaN yellow curve
1 2 2 apple NaN red NaN
2 2 3 orange NaN NaN round
3 10 4 strawberry 500.0 NaN NaN
4 4 5 toast NaN NaN NaN

You can use:
d = {k: {} if v != v else v for k, v in df.pop('information').items()}
df1 = pd.DataFrame.from_dict(d, orient='index')
df = pd.concat([df, df1], axis=1)
print(df)
id name cost shape color amount
0 1 banana 1 curve yellow NaN
1 2 apple 2 NaN red NaN
2 3 orange 2 round NaN NaN
3 4 strawberry 10 NaN NaN 500.0
4 5 toast 4 NaN NaN NaN

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Pandas merging/joining tables with multiple key columns and duplicating rows where necessary

I have several tables that contain lab results, with a 'master' table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns - I'm just showing one here. I want to combine all the results tables into one dataframe, like this:
Table 1:
Location Sample Description
1 A Yellow
1 B Red
2 A Blue
2 B Violet
Table 2
Location Sample Specimen Result1
1 A X 5
1 A Y 6
1 B X 10
2 A X 1
Table 3
Location Sample Specimen Result2
1 A X "Heavy"
1 A Q "Soft"
1 B K "Grey"
2 B Z "Bananas"
Desired Output:
Location Sample Description Specimen Result1 Result2
1 A Yellow X 5 "Heavy"
1 A Yellow Y 6 nan
1 A Yellow Q nan "Soft"
1 B Red X 10 nan
1 B Red K nan "Grey"
2 A Blue X 1 nan
2 B Violet Z nan "Bananas"
I currently have a solution for this using iterrows() and df.append(), but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join() and merge() but I can't seem to get the result I want.
Quick code to reproduce my dataframes:
dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']}
dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]}
dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]}
df1 = pd.DataFrame.from_dict(dict1)
df2 = pd.DataFrame.from_dict(dict2)
df3 = pd.DataFrame.from_dict(dict3)
First idea is join df2, df3 together by concat and for unique 'Location','Sample','Specimen' rows are rows aggregated by sum, last merge to df1:
df23 = (pd.concat([df2, df3])
.groupby(['Location','Sample','Specimen'], as_index=False, sort=False)
.sum(min_count=1))
df = df1.merge(df23, on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow x 5.0 4.0
1 1 A Yellow y 6.0 NaN
2 1 A Yellow q NaN 6.0
3 1 B Red x 10.0 NaN
4 1 B Red k NaN 8.0
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
Or if all rows in df2,df3 per columns ['Location','Sample','Specimen'] are unique, solution is simplier:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN 6.0
1 1 A Yellow x 5.0 4.0
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN 8.0
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
EDIT: With new data second solution working well:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN Soft
1 1 A Yellow x 5.0 Heavy
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN Grey
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN Bananas

Drop rows from a slice of Multi-Index DataFrame based on boolean

EDIT: Upon request I provide an example that is closer to the real data I am working with.
So I have a table data that looks something like
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.265421 -0.623274 0.041326
4 -2.325031 -0.218792 -1.245911
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.042513 -0.128535
1 1.366463 -0.665195 0.35151
2 0.90347 0.094012 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.009618 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
(think: collection of time series) and a second table valid_range
start stop
run
0 1 3
1 2 5
For each run I want to drop all rows that do not satisfy start≤step≤stop.
I tried the following (table generating code at the end)
for idx in valid_range.index:
slc = data.loc[idx]
start, stop = valid_range.loc[idx]
cond = (start <= slc.index) & (slc.index <= stop)
data.loc[idx] = data.loc[idx][cond]
However, this results in:
value0 value1 value2
run step
0 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
1 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...
Generating code for table
import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)
valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)
for run in range(2):
for step in range(6):
data.loc[(run, step), :] = rng.normal(size=(3))
)
First, merge data and valid range based on 'run', using the merge method
>>> data
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.26542 -0.623274 0.041326
4 -2.32503 -0.218792 -1.24591
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.04251 -0.128535
1 1.36646 -0.665195 0.35151
2 0.90347 0.0940123 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
>>> valid_range
start stop
run
0 1 3
1 2 5
>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged
run step value0 value1 value2 start stop
0 0 0 0.12573 -0.132105 0.640423 1 3
1 0 1 0.1049 -0.535669 0.361595 1 3
2 0 2 1.304 0.947081 -0.703735 1 3
3 0 3 -1.26542 -0.623274 0.041326 1 3
4 0 4 -2.32503 -0.218792 -1.24591 1 3
5 0 5 -0.732267 -0.544259 -0.3163 1 3
6 1 0 0.411631 1.04251 -0.128535 2 5
7 1 1 1.36646 -0.665195 0.35151 2 5
8 1 2 0.90347 0.0940123 -0.743499 2 5
9 1 3 -0.921725 -0.457726 0.220195 2 5
10 1 4 -1.00962 -0.209176 -0.159225 2 5
11 1 5 0.540846 0.214659 0.355373 2 5
Then select the rows which satisfy the condition using eval. Use the boolean array to mask data
>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]
value0 value1 value2
run step
0 2 1.304 0.947081 -0.703735
1 3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
Or if you want, here is a similar approach using query
res = (
data.reset_index()
.merge(valid_range, on='run', how='left')
.query('start < step < stop')
.drop(columns=['start','stop'])
.set_index(['run', 'step'])
)
I would go on groupby like this:
(df.groupby(level=0)
.apply(lambda x: x[x['small']>1])
.reset_index(level=0, drop=True) # remove duplicate index
)
which gives:
big small
animal animal attribute
cow cow speed 30.0 20.0
weight 250.0 150.0
falcon falcon speed 320.0 250.0
lama lama speed 45.0 30.0
weight 200.0 100.0

Pandas replace all non-NaN entries of a dataframe with 1 leave NaN alone

How can I replace all the non-NaN values in a pandas dataframe with 1 but leave the NaN values alone? This almost does what I'm looking for. The problem is it also makes NaN values 0. Then I have to reset them to NaN after.
I would like this
a b
0 NaN QQQ
1 AAA NaN
2 NaN BBB
to become this
a b
0 NaN 1
1 1 NaN
2 NaN 1
This code is almost what I want
newdf = df.notnull().astype('int')
The above code does this
a b
0 0 1
1 1 0
2 0 1
One way would be to select all non-null values from the original data frame and set them to one:
df[df.notnull()] = 1
This solution on your data:
df = pd.DataFrame({'a': [np.nan, 'AAA', np.nan], 'b': ['QQQ', np.nan, 'BBB']})
df[df.notnull()] = 1
df
a b
0 NaN 1
1 1 NaN
2 NaN 1
You can use np.where() with DataFrame.isna() to accomplish this
df=pd.DataFrame(data=[[1,np.NaN,5],
['q',np.NaN,np.NaN],
['7',{'a':1},np.NaN]],
columns=['a','b','c'])
a b c
0 1 NaN 5.0
1 q NaN NaN
2 7 {'a': 1} NaN
df1=pd.DataFrame(np.where(df.isna(),df,1), columns=df.columns)
a b c
0 1 NaN 1
1 1 NaN NaN
2 1 1 NaN

Pandas merge rows with ids in separate columns

Total meltdown here, need some assistance.
I have a DataFrame with +10m rows and some 150 columns with two ids, looking like below:
df = pd.DataFrame({'id1' : [1,2,5,3,6,4]
,'id2' : [2,1,np.nan,4,np.nan,3]
,'num' : [123, 3231, 123, 231, 6534,2394]})
id1 id2 num
0 1 2.0 123
1 2 1.0 3231
2 5 NaN 123
3 3 4.0 231
4 6 NaN 6534
5 4 3.0 2394
Where row index 0 and 1 are a pair given id1 and id2, and row index 3 and 5 are a pair in the same way. I want the table below, where the second row pair is merged with first row pair
df = pd.DataFrame({'id1' : [1,5,3,6]
,'id2' : [2,np.nan,3,np.nan]
,'num' : [123, 123, 231, 6534]
,'2num' : [3231, np.nan, 2394, np.nan,]})
id1 id2 num 2_num
0 1 2.0 123 3231.0
1 5 NaN 123 NaN
2 3 3.0 231 2394.0
3 6 NaN 6534 NaN
How can this be archived using id1 and id2 and labeling all following columns from "id row 2" with "2_"?
Heres one a merge based approach ,(thank you #pirSquared for improvement). i.e
ndf = df.merge(df, 'left', left_on=['id1', 'id2'], right_on=['id2', 'id1'], suffixes=['', '_2']).drop(['id1_2', 'id2_2'], 1)
cols = ['id1','id2']
ndf[cols] = np.sort(ndf[cols],1)
new = ndf.drop_duplicates(subset=['id1','id2'],keep='first')
id1 id2 num num_2
0 1.0 2.0 123 3231.0
2 5.0 NaN 123 NaN
3 3.0 4.0 231 2394.0
4 6.0 NaN 6534 NaN
The idea is to sort each pair of ids so that we group by them.
cols = ['id1', 'id2']
df[cols] = np.sort(df[cols], 1)
df.set_index(
cols + [df.fillna(-1).groupby(cols).cumcount() + 1]
).num.unstack().add_suffix('_num').reset_index()
id1 id2 1_num 2_num
0 1.0 2.0 123.0 3231.0
1 3.0 4.0 231.0 2394.0
2 5.0 NaN 123.0 NaN
3 6.0 NaN 6534.0 NaN
Use:
df[['id1','id2']] = pd.DataFrame(np.sort(df[['id1','id2']].values, axis=1)).fillna('tmp')
print (df)
id1 id2 num
0 1.0 2 123
1 1.0 2 3231
2 5.0 tmp 123
3 3.0 4 231
4 6.0 tmp 6534
5 3.0 4 2394
df1 = df.groupby(['id1','id2'])['num'].apply(list)
print (df1)
id1 id2
1.0 2.0 [123, 3231]
3.0 4.0 [231, 2394]
5.0 tmp [123]
6.0 tmp [6534]
Name: num, dtype: object
df2 = pd.DataFrame(df1.values.tolist(),
index=df1.index,
columns=['num','2_num'])
.reset_index().replace('tmp', np.nan)
print (df2)
id1 id2 num 2_num
0 1.0 2.0 123 3231.0
1 3.0 4.0 231 2394.0
2 5.0 NaN 123 NaN
3 6.0 NaN 6534 NaN

Categories