Related
I'm currently trying to parse excel files that contain somewhat structured information. The data I am interested in is in a subrange of an excel sheet. Basically the excel contains key-value pairs where the key is usually named in a predictable manner (found with regex). Keys are in the same column and the value pair is on the right side of the key in the excel sheet.
Regex pattern pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment' predictably matches the keys. Therefore if I can find the column where the keys are located and the rows where the keys are present, I am able to find the subrange of interest and parse it further.
Goals:
Get list of row indices that match regex (e.g. [5, 6, 8, 9])
Find which column contains keys that match regex (e.g. Unnamed: 3)
When I read in the excel using df_original = pd.read_excel(filename, sheet_name=sheet) the dataframe looks like this
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
This code finds the rows of interest and solves Goal 1.
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
What is the easiest way to solve Goal 2? Basically I want to find columns that contain at least one value that matches the regex pattern. The wanted output would be [Unnamed: 5]. There may be some easy way to solve goals 1 and 2 at the same time. For example:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
+----+-----------------+--------------+--------------+--------------+
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+-----------------+--------------+--------------+--------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
+----+-----------------+--------------+--------------+--------------+
One option is with xlsx_cells from pyjanitor; it reads each cell as a single row; this way you are afforded more manipulation freedom; for your use case it can be handy and an alternative:
# pip install pyjanitor
import pandas as pd
import janitor as jn
Read in data
df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
value internal_value coordinate row column data_type is_date number_format
0 Value Value A2 2 1 s False General
1 High High B2 2 2 s False General
2 Other Other D2 2 4 s False General
3 Other 2 Other 2 E2 2 5 s False General
4 Name Name A3 3 1 s False General
Filter for rows that match the pattern:
bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)
vals = df.loc[bools, ['value', 'row', 'column']]
vals
value row column
16 Temperature (C) 7 4
20 Strength 8 4
24 Temperature (F) 10 4
26 Comment 11 4
Look for values that are on the same row as vals, and are in columns greater than the column in vals:
bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())
result = df.loc[bools, ['value', 'row', 'column']]
result
value row column
17 25 7 6
21 1500 8 5
25 77 10 6
27 Looks OK 11 6
28 Add water 11 7
Merge vals and result to get the final output
(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
)
val row value
0 Temperature (C) 7 25
1 Strength 8 1500
2 Temperature (F) 10 77
3 Comment 11 Looks OK
4 Comment 11 Add water
Try one of the following 2 options:
Option 1 (assuming no not-NaN data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanation
First, we use df.stack to add column names as a level to the index, and get all the data just in one column.
Now, we can apply Series.str.contains to find a match for r'[Tt]emperature'. We chain Series.idxmax to "[r]eturn the row label of the maximum value". I.e. this will be the first True, so we will get back (5, 'Unnamed: 3'), to be stored in idx and col respectively.
Now, we know where to start our selection from the df, namely at index 5 and column Unnamed: 3. If we simply want all the data (to the right, and to bottom) from here on, we can use: df_original.loc[idx:, col:] and finally, drop all remaining rows that have only NaN values.
Option 2 (potential data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index
res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanantion
Basically, the procedure here is the same as with option 1, except that we want to retrieve all the index values, rather than just the first one (for "[Tt]emperature (C)"). After tmp[tmp].index, we get tmp as:
MultiIndex([(5, 'Unnamed: 3'),
(6, 'Unnamed: 3'),
(8, 'Unnamed: 3'),
(9, 'Unnamed: 3')],
)
In the next step, we use these values as coordinates for df.loc. I.e. for the index selection, we want all values, so we use index.get_level_values; for the column, we only need the first value (they should all be the same of course: Unnamed: 3).
Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a
I am currently calculating the cash to price ratio of about 19,000 companies over the last 10 years. I have this all on one data frame and have 20+ variables. The problem I'd like to solve is to have the rolling sum restart once a new stock ticker is introduced. The way I coded below causes the first three entries of a new stock to also sum the Q_Cashflow of the stock just before it in the column.
I coded as follows:
df['K_Cashflow'] = df.Q_Cashflow.rolling(4).sum()
df['cash-to-price'] = df['K_Cashflow']/df['Market Cap']
The output is :
Ticker Symbol |Q_Cashflow |Market Cap |cash-to-price |K_Cashflow|
44 ADCT.1 | 16.9 |709.0700 |0.120157 | 85.2 |
45 ADCT.1 | 102.2 |718.7700 |0.310948 | 223.5 |
46 ADCT.1 | 136.6 |1231.5240 |0.260815 | 321.2 |
47 AAL | 456.0 |3034.1766 |0.234561 | 711.7 |
48 AAL | 1173.0 |2258.1468 |0.827138 | 1867.8 |
49 AAL | 1090.0 |2088.2862 |1.367437 | 2855.6 |
50 AAL | 1241.0 |2597.5755 |1.524499 | 3960.0 |
Lines 47:50 should be NaN for K_Cashflow.
How would I change the first three entries of K_Cashflow to Nan for every different Ticker Symbol?
One way to accomplish this would be to create rank column based off the ticker and then assign the lowest three ranks to a nan. Here's an example:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'ticker': ['a'] * 7 + ['b'] * 10,
'cash_flow': range(17),
})
# Create the rank
df['rank'] = df.groupby('ticker').rank()
# Set the first 3 instances of each ticker to nan
df.loc[df['rank'] < 4, ['cash_flow']] = np.nan
df
ticker cash_flow rank
0 a NaN 1.0
1 a NaN 2.0
2 a NaN 3.0
3 a 3.0 4.0
4 a 4.0 5.0
5 a 5.0 6.0
6 a 6.0 7.0
7 b NaN 1.0
8 b NaN 2.0
9 b NaN 3.0
10 b 10.0 4.0
11 b 11.0 5.0
12 b 12.0 6.0
13 b 13.0 7.0
14 b 14.0 8.0
15 b 15.0 9.0
16 b 16.0 10.0
I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?
You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32
Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0
This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]
Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
I have a dataframe which looks like:
PRIO Art Name Value
1 A Alpha 0
1 A Alpha 0
1 A Beta 1
2 A Alpha 3
2 B Theta 2
How can I transpose the dataframe, that I have all unique names as a column with the corresponding values to it (note that duplicate rows I want to ignore)?
So in this case:
PRIO Art Alpha Alpha_value Beta Beta_value Theta Theta_value
1 A 1 0 1 1 NaN NaN
2 A 1 3 NaN NaN NaN NaN
2 B NaN NaN NaN NaN 1 2
Here's one way using pivot_table. A few tricky things to keep in mind:
You need to specify both 'PRIO', 'Art' as pivot index
We can also use two aggregation funcs to get it done in a single call
We have to rename the level 0 columns to distinguish them. So you need to swap levels and rename
out = df.pivot_table(index=['PRIO', 'Art'], columns='Name', values='Value',
aggfunc=[lambda x: 1, 'first'])
# get the column names right
d = {'<lambda>':'is_present', 'first':'value'}
out = out.rename(columns=d, level=0)
out.columns = out.swaplevel(1,0, axis=1).columns.map('_'.join)
print(out.reset_index())
PRIO Art Alpha_is_present Beta_is_present Theta_is_present Alpha_value \
0 1 A 1.0 1.0 NaN 0.0
1 2 A 1.0 NaN NaN 3.0
2 2 B NaN NaN 1.0 NaN
Beta_value Theta_value
0 1.0 NaN
1 NaN NaN
2 NaN 2.0
Groupby twice, first to pivot Name and suffix with value. Next groupby same imperatives and find unique values. Join the two.In the joining, drop the duplicate columns and rename others as appropriate
g=df.groupby([ 'Art','PRIO', 'Name'])['Value'].\
first().unstack().reset_index().add_suffix('_value')
print(g.join(df.groupby(['PRIO', 'Art','Name'])['Value'].\
nunique().unstack('Name').reset_index()).drop(columns=['PRIO_value','Art'])\
.rename(columns={'Art_value':'Art'}))
Name Art Alpha_value Beta_value Theta_value PRIO Alpha Beta Theta
0 A 0.0 1.0 NaN 1 1.0 1.0 NaN
1 A 3.0 NaN NaN 2 1.0 NaN NaN
2 B NaN NaN 2.0 2 NaN NaN 1.0
This is an example of pd.crosstab() and groupby().
df = pd.concat([pd.crosstab([df['PRIO'],df['Art']], df['Name']),df.groupby(['PRIO','Art','Name'])['Value'].sum().unstack().add_suffix('_value')],axis=1).reset_index()
df
| | Alpha | Beta | Theta | Alpha_value | Beta_value | Theta_value |
|:---------|--------:|-------:|--------:|--------------:|-------------:|--------------:|
| (1, 'A') | 1 | 1 | 0 | 0 | 1 | nan |
| (2, 'A') | 1 | 0 | 0 | 3 | nan | nan |
| (2, 'B') | 0 | 0 | 1 | nan | nan | 2 |