I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?
You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32
Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0
This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]
Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Related
lets say I have a dataframe like below
+------+------+------+-------------+
| A | B | C | devisor_col |
+------+------+------+-------------+
| 2 | 4 | 10 | 2 |
| 3 | 3 | 9 | 3 |
| 10 | 25 | 40 | 10 |
+------+------+------+-------------+
what would be the best command to apply a formula using values from the devisor_col. Do note that I have thousand of column and rows.
the result should be like this:
+------+------+------+-------------+
| A | B | V | devisor_col |
+------+------+------+-------------+
| 1 | 2 | 5 | 2 |
| 1 | 1 | 3 | 3 |
| 1 | 1.5 | 4 | 10 |
+------+------+------+-------------+
I tried using apply map but I dont know why I cant apply it to all columns.
modResult = my_df.applymap(lambda x: x/x["devisor_col"]))
IIUC, use pandas.DataFrame.divide on axis=0 :
modResult= (
pd.concat(
[my_df, my_df.filter(like="Col") # selecting columns
.divide(my_df["devisor_col"], axis=0).add_suffix("_div")], axis=1)
)
# Output :
print(modResult)
Col1 Col2 Col3 devisor_col Col1_div Col2_div Col3_div
0 2 4 10 2 1.0 2.0 5.0
1 3 3 9 3 1.0 1.0 3.0
2 10 25 40 10 1.0 2.5 4.0
If you need only the result of the divide, use this :
modResult= my_df.filter(like="Col").divide(my_df["devisor_col"], axis=0)
print(modResult)
Col1 Col2 Col3
0 1.0 2.0 5.0
1 1.0 1.0 3.0
2 1.0 2.5 4.0
Or if you want to overwrite the old columns, use pandas.DataFrame.join:
modResult= (
my_df.filter(like="Col")
.divide(my_df["devisor_col"], axis=0)
.join(my_df["devisor_col"])
)
Col1 Col2 Col3 devisor_col
0 1.0 2.0 5.0 2
1 1.0 1.0 3.0 3
2 1.0 2.5 4.0 10
You can replace my_df.filter(like="Col") with my_df.loc[:, my_df.columns!="devisor_col"].
You can try using .loc
df = pd.DataFrame([[1,2,3,1],[2,3,4,5],[4,5,6,7]], columns=['col1', 'col2', 'col3', 'divisor'])
df.loc[:, df.columns != 'divisor'] = df.loc[:, df.columns != 'divisor'].divide(df['divisor'], axis=0)
I'm currently trying to parse excel files that contain somewhat structured information. The data I am interested in is in a subrange of an excel sheet. Basically the excel contains key-value pairs where the key is usually named in a predictable manner (found with regex). Keys are in the same column and the value pair is on the right side of the key in the excel sheet.
Regex pattern pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment' predictably matches the keys. Therefore if I can find the column where the keys are located and the rows where the keys are present, I am able to find the subrange of interest and parse it further.
Goals:
Get list of row indices that match regex (e.g. [5, 6, 8, 9])
Find which column contains keys that match regex (e.g. Unnamed: 3)
When I read in the excel using df_original = pd.read_excel(filename, sheet_name=sheet) the dataframe looks like this
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
This code finds the rows of interest and solves Goal 1.
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
What is the easiest way to solve Goal 2? Basically I want to find columns that contain at least one value that matches the regex pattern. The wanted output would be [Unnamed: 5]. There may be some easy way to solve goals 1 and 2 at the same time. For example:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
+----+-----------------+--------------+--------------+--------------+
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+-----------------+--------------+--------------+--------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
+----+-----------------+--------------+--------------+--------------+
One option is with xlsx_cells from pyjanitor; it reads each cell as a single row; this way you are afforded more manipulation freedom; for your use case it can be handy and an alternative:
# pip install pyjanitor
import pandas as pd
import janitor as jn
Read in data
df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
value internal_value coordinate row column data_type is_date number_format
0 Value Value A2 2 1 s False General
1 High High B2 2 2 s False General
2 Other Other D2 2 4 s False General
3 Other 2 Other 2 E2 2 5 s False General
4 Name Name A3 3 1 s False General
Filter for rows that match the pattern:
bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)
vals = df.loc[bools, ['value', 'row', 'column']]
vals
value row column
16 Temperature (C) 7 4
20 Strength 8 4
24 Temperature (F) 10 4
26 Comment 11 4
Look for values that are on the same row as vals, and are in columns greater than the column in vals:
bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())
result = df.loc[bools, ['value', 'row', 'column']]
result
value row column
17 25 7 6
21 1500 8 5
25 77 10 6
27 Looks OK 11 6
28 Add water 11 7
Merge vals and result to get the final output
(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
)
val row value
0 Temperature (C) 7 25
1 Strength 8 1500
2 Temperature (F) 10 77
3 Comment 11 Looks OK
4 Comment 11 Add water
Try one of the following 2 options:
Option 1 (assuming no not-NaN data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanation
First, we use df.stack to add column names as a level to the index, and get all the data just in one column.
Now, we can apply Series.str.contains to find a match for r'[Tt]emperature'. We chain Series.idxmax to "[r]eturn the row label of the maximum value". I.e. this will be the first True, so we will get back (5, 'Unnamed: 3'), to be stored in idx and col respectively.
Now, we know where to start our selection from the df, namely at index 5 and column Unnamed: 3. If we simply want all the data (to the right, and to bottom) from here on, we can use: df_original.loc[idx:, col:] and finally, drop all remaining rows that have only NaN values.
Option 2 (potential data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index
res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanantion
Basically, the procedure here is the same as with option 1, except that we want to retrieve all the index values, rather than just the first one (for "[Tt]emperature (C)"). After tmp[tmp].index, we get tmp as:
MultiIndex([(5, 'Unnamed: 3'),
(6, 'Unnamed: 3'),
(8, 'Unnamed: 3'),
(9, 'Unnamed: 3')],
)
In the next step, we use these values as coordinates for df.loc. I.e. for the index selection, we want all values, so we use index.get_level_values; for the column, we only need the first value (they should all be the same of course: Unnamed: 3).
Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a
I have a concept of what I need to do, but I can't write the right code to run, please take a look and give some advice.
step 1. find the rows that contains values in the second column
step 2. with those rows, compare the value in the first column with their previous row
step 3. drop the rows with larger first column value
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | 60 |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 1 | 180 |
| 1 | nan |
| 0 | 120 |
eg. I want to compare the missing values with the rows values in diff [120,180,60] and their previous rows. in the end, the desire dataframe will look like
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 0 | 120 |
update question according to the answer, got the same df as original df
import pandas as pd
import numpy as np
data={'missing':[0,1,1,0,0,1,1,0],'diff':[np.nan,60,np.nan,np.nan,np.nan,180,np.nan,120]}
df=pd.DataFrame(data)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
if df['diff'][ind]!=np.nan:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
IIUC, you can try:
m = df['diff'].notna()
df = (
pd.concat([
df[df['diff'].isna()],
df[m][df[m.shift(-1).fillna(False)]['missing'].values >
df[m]['missing'].values]
])
)
OUTPUT:
missing diff
1 0 <NA>
3 1 <NA>
4 0 <NA>
5 0 <NA>
7 1 <NA>
8 0 120
This will work for sure
for ind in df.index:
if np.isnan(df['diff'][ind])==False:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
This will work
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
import pandas as pd #import pandas
#define dictionary
data={'missing':[0,1,1,0,0,1,1,0],'diff':[nan,60,nan,nan,nan,180,nan,120]}
#dictionary to dataframe
df=pd.DataFrame(data)
print(df)
#for each row in dataframe
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
#only each row whose diff value is a number
#find the rows that contains values in the second column and compare it with previous value
if df['missing'][ind]>df['missing'][ind-1]:
#drop the rows with larger first column value
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
print(df)
I have a dataframe which looks like:
PRIO Art Name Value
1 A Alpha 0
1 A Alpha 0
1 A Beta 1
2 A Alpha 3
2 B Theta 2
How can I transpose the dataframe, that I have all unique names as a column with the corresponding values to it (note that duplicate rows I want to ignore)?
So in this case:
PRIO Art Alpha Alpha_value Beta Beta_value Theta Theta_value
1 A 1 0 1 1 NaN NaN
2 A 1 3 NaN NaN NaN NaN
2 B NaN NaN NaN NaN 1 2
Here's one way using pivot_table. A few tricky things to keep in mind:
You need to specify both 'PRIO', 'Art' as pivot index
We can also use two aggregation funcs to get it done in a single call
We have to rename the level 0 columns to distinguish them. So you need to swap levels and rename
out = df.pivot_table(index=['PRIO', 'Art'], columns='Name', values='Value',
aggfunc=[lambda x: 1, 'first'])
# get the column names right
d = {'<lambda>':'is_present', 'first':'value'}
out = out.rename(columns=d, level=0)
out.columns = out.swaplevel(1,0, axis=1).columns.map('_'.join)
print(out.reset_index())
PRIO Art Alpha_is_present Beta_is_present Theta_is_present Alpha_value \
0 1 A 1.0 1.0 NaN 0.0
1 2 A 1.0 NaN NaN 3.0
2 2 B NaN NaN 1.0 NaN
Beta_value Theta_value
0 1.0 NaN
1 NaN NaN
2 NaN 2.0
Groupby twice, first to pivot Name and suffix with value. Next groupby same imperatives and find unique values. Join the two.In the joining, drop the duplicate columns and rename others as appropriate
g=df.groupby([ 'Art','PRIO', 'Name'])['Value'].\
first().unstack().reset_index().add_suffix('_value')
print(g.join(df.groupby(['PRIO', 'Art','Name'])['Value'].\
nunique().unstack('Name').reset_index()).drop(columns=['PRIO_value','Art'])\
.rename(columns={'Art_value':'Art'}))
Name Art Alpha_value Beta_value Theta_value PRIO Alpha Beta Theta
0 A 0.0 1.0 NaN 1 1.0 1.0 NaN
1 A 3.0 NaN NaN 2 1.0 NaN NaN
2 B NaN NaN 2.0 2 NaN NaN 1.0
This is an example of pd.crosstab() and groupby().
df = pd.concat([pd.crosstab([df['PRIO'],df['Art']], df['Name']),df.groupby(['PRIO','Art','Name'])['Value'].sum().unstack().add_suffix('_value')],axis=1).reset_index()
df
| | Alpha | Beta | Theta | Alpha_value | Beta_value | Theta_value |
|:---------|--------:|-------:|--------:|--------------:|-------------:|--------------:|
| (1, 'A') | 1 | 1 | 0 | 0 | 1 | nan |
| (2, 'A') | 1 | 0 | 0 | 3 | nan | nan |
| (2, 'B') | 0 | 0 | 1 | nan | nan | 2 |