Merging multiple datasets in pandas - python

I'm struggling to correctly merge a few datasets in pandas. Let's say I've measured variables A, B, and C, at different times. Sometimes, I've got A and B at the same time, and sometimes not. I have three dataframes, where the dataframe's index is the time of measurement, and a column for the measurement. If I concatenate these dataframes, I get a bunch of NaNs where I have no measurements, maybe something like
idx | A | B | C
-----|-----|-----|----
0 | 1 | NaN | NaN
0 | NaN | 2 | 3
1 | 5 | 3 | NaN
In concatenating, I have non-unique time indices. What I'd like is to sort this by time, and collapse together rows with the same time index. The ideal result here is
idx | A | B | C
-----|-----|-----|----
0 | 1 | 2 | 3
1 | 5 | 3 | NaN
That would be the first scenario. To further complicate things, I may have a column, D, which specifies the location the measurement was taken. I'd thus need to allow this collapsing to keep non-unique indices as long as the entries in D are different for that time. Maybe we have
idx | A | B | C | D
-----|-----|-----|-----|-----
0 | 1 | NaN | NaN | Paris
0 | NaN | 2 | 3 | NYC
1 | 5 | 3 | NaN | NYC
1 | NaN | NaN | 0 | Paris
This dataframe cannot be collapsed any further, because, conditioned on D, it's already got unique times and information is as collapsed as possible.
I'm still trying to get my head around the various join / merge / concat operations and how they work, but I'd love a pointer or two.
Thank you !

Assuming that your index is a Timestamp, try to resample it at your desired frequency (e.g. hourly, daily, weekly, etc). You can take the mean measurement in case there are multiple samples observed during the window.
df = pd.DataFrame({'A': {Timestamp('2015-01-01 11:30:00'): 1.0,
Timestamp('2015-01-01 12:30:00'): nan,
Timestamp('2015-01-02 11:15:00'): 5.0,
Timestamp('2015-01-02 12:15:00'): nan},
'B': {Timestamp('2015-01-01 11:30:00'): nan,
Timestamp('2015-01-01 12:30:00'): 2.0,
Timestamp('2015-01-02 11:15:00'): 3.0,
Timestamp('2015-01-02 12:15:00'): nan},
'C': {Timestamp('2015-01-01 11:30:00'): nan,
Timestamp('2015-01-01 12:30:00'): 3.0,
Timestamp('2015-01-02 11:15:00'): nan,
Timestamp('2015-01-02 12:15:00'): 0.0},
'D': {Timestamp('2015-01-01 11:30:00'): 'Paris',
Timestamp('2015-01-01 12:30:00'): 'NYC',
Timestamp('2015-01-02 11:15:00'): 'NYC',
Timestamp('2015-01-02 12:15:00'): 'Paris'}})
>>> df
A B C D
2015-01-01 11:30:00 1 NaN NaN Paris
2015-01-01 12:30:00 NaN 2 3 NYC
2015-01-02 11:15:00 5 3 NaN NYC
2015-01-02 12:15:00 NaN NaN 0 Paris
>>> df.resample('1D', how='mean')
A B C
2015-01-01 1 2 3
2015-01-02 5 3 0
To account for the point of observation, you need to include it as a multi index column. An easy way to do this is by grouping on date and location (column D), and then unstacking.
>>> df.reset_index().groupby(['index', 'D']).mean().unstack().resample('1D', how='mean')
A B C
D NYC Paris NYC Paris NYC Paris
index
2015-01-01 NaN 1 2 NaN 3 NaN
2015-01-02 5 NaN 3 NaN NaN 0

Related

Pandas find cell location that matches regex

I'm currently trying to parse excel files that contain somewhat structured information. The data I am interested in is in a subrange of an excel sheet. Basically the excel contains key-value pairs where the key is usually named in a predictable manner (found with regex). Keys are in the same column and the value pair is on the right side of the key in the excel sheet.
Regex pattern pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment' predictably matches the keys. Therefore if I can find the column where the keys are located and the rows where the keys are present, I am able to find the subrange of interest and parse it further.
Goals:
Get list of row indices that match regex (e.g. [5, 6, 8, 9])
Find which column contains keys that match regex (e.g. Unnamed: 3)
When I read in the excel using df_original = pd.read_excel(filename, sheet_name=sheet) the dataframe looks like this
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
This code finds the rows of interest and solves Goal 1.
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
What is the easiest way to solve Goal 2? Basically I want to find columns that contain at least one value that matches the regex pattern. The wanted output would be [Unnamed: 5]. There may be some easy way to solve goals 1 and 2 at the same time. For example:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
+----+-----------------+--------------+--------------+--------------+
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+-----------------+--------------+--------------+--------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
+----+-----------------+--------------+--------------+--------------+
One option is with xlsx_cells from pyjanitor; it reads each cell as a single row; this way you are afforded more manipulation freedom; for your use case it can be handy and an alternative:
# pip install pyjanitor
import pandas as pd
import janitor as jn
Read in data
df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
value internal_value coordinate row column data_type is_date number_format
0 Value Value A2 2 1 s False General
1 High High B2 2 2 s False General
2 Other Other D2 2 4 s False General
3 Other 2 Other 2 E2 2 5 s False General
4 Name Name A3 3 1 s False General
Filter for rows that match the pattern:
bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)
vals = df.loc[bools, ['value', 'row', 'column']]
vals
value row column
16 Temperature (C) 7 4
20 Strength 8 4
24 Temperature (F) 10 4
26 Comment 11 4
Look for values that are on the same row as vals, and are in columns greater than the column in vals:
bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())
result = df.loc[bools, ['value', 'row', 'column']]
result
value row column
17 25 7 6
21 1500 8 5
25 77 10 6
27 Looks OK 11 6
28 Add water 11 7
Merge vals and result to get the final output
(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
)
val row value
0 Temperature (C) 7 25
1 Strength 8 1500
2 Temperature (F) 10 77
3 Comment 11 Looks OK
4 Comment 11 Add water
Try one of the following 2 options:
Option 1 (assuming no not-NaN data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanation
First, we use df.stack to add column names as a level to the index, and get all the data just in one column.
Now, we can apply Series.str.contains to find a match for r'[Tt]emperature'. We chain Series.idxmax to "[r]eturn the row label of the maximum value". I.e. this will be the first True, so we will get back (5, 'Unnamed: 3'), to be stored in idx and col respectively.
Now, we know where to start our selection from the df, namely at index 5 and column Unnamed: 3. If we simply want all the data (to the right, and to bottom) from here on, we can use: df_original.loc[idx:, col:] and finally, drop all remaining rows that have only NaN values.
Option 2 (potential data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index
res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanantion
Basically, the procedure here is the same as with option 1, except that we want to retrieve all the index values, rather than just the first one (for "[Tt]emperature (C)"). After tmp[tmp].index, we get tmp as:
MultiIndex([(5, 'Unnamed: 3'),
(6, 'Unnamed: 3'),
(8, 'Unnamed: 3'),
(9, 'Unnamed: 3')],
)
In the next step, we use these values as coordinates for df.loc. I.e. for the index selection, we want all values, so we use index.get_level_values; for the column, we only need the first value (they should all be the same of course: Unnamed: 3).

get a column containing the first non-nan value from a group of columns

Given data as:
| | a | b | c |
|---:|----:|----:|----:|
| 0 | nan | nan | 1 |
| 1 | nan | 2 | nan |
| 2 | 3 | 3 | 3 |
I would like to create some column d containing [1, 2, 3]
There can be an arbitrary amount of columns (though it's going to be <30).
Using
df.isna().apply(lambda x: x.idxmin(), axis=1)
Will give me:
0 c
1 b
2 a
dtype: object
Which seems useful, but I'm drawing a blank on how to access the columns with this, or whether there's a more suitable approach.
Repro:
import io
import pandas as pd
df = pd.read_csv(io.StringIO(',a,b,c\n0,,,1\n1,,2,\n2,3,3,3\n'))
Try this:
df.fillna(method='bfill', axis=1).iloc[:, 0]
What if you use min on axis = 1 ? :
df['min_val'] = df.min(axis=1)
a b c min_val
0 NaN NaN 1.0 1.0
1 NaN 2.0 NaN 2.0
2 3.0 3.0 3.0 3.0
And to get the respective columns:
df['min_val_col'] = df.idxmin(axis=1)
a b c min_val_col
0 NaN NaN 1.0 c
1 NaN 2.0 NaN b
2 3.0 3.0 3.0 a

List in pandas dataframe columns

I have the following pandas dataframe
| A | B |
| :-|:------:|
| 1 | [2,3,4]|
| 2 | np.nan |
| 3 | np.nan |
| 4 | 10 |
I would like to unlist the first row and place those values sequentially in the subsequent rows. The outcome will look like this:
| A | B |
| :-|:------:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 10 |
How can I achieve this in a very large dataset with this phenomena occurring in many rows?
If the number of NaN values serve as a "slack" space, so that list elements can slot in, i.e. if the lengths match, then you can explode columns "B", then drop NaN values with dropna, reset index and assign back to "B":
df['B'] = df['B'].explode().dropna().reset_index(drop=True)
Output:
A B
0 1 2
1 2 3
2 3 4
3 4 10
As the number of consecutive NaNs does not match the length of the list, you can make groups starting with non NaN elements and explode while keeping the length of the group constant.
I used a slightly different example for clarity (I also assigned to a different column):
df['C'] = (df['B']
.groupby(df['B'].notna().cumsum())
.apply(lambda s: s.explode().iloc[:len(s)])
.values
)
Output:
A B C
0 1 [2, 3, 4] 2
1 2 NaN 3
2 3 NaN 4
3 4 NaN NaN
4 5 10 10
Used input:
df = pd.DataFrame({'A': range(1,6),
'B': [[2,3,4], np.nan, np.nan, np.nan, 10]
})

Python explode multiple columns where some rows are NaN

I am trying to apply the Python explode function to unpack a few columns that are | delimited. Each row will be the same | delimited length (e.g. A will have the same |s as B) but rows can have different lengths from one another (e.g. row 1 is length 3 and rows 2 is length 2).
There are some rows where there may be an NaN here and there (e.g. A and C) which is causing the following error "columns must have matching element counts"
Current data:
A | B | C
1 | 2 | 3 app | ban | cor NaN
4 | 5 dep | exp NaN
NaN for | gep NaN
Expected output:
A | B | C
1 app NaN
2 ban NaN
3 cor NaN
4 dep NaN
5 exp NaN
NaN for NaN
NaN gep NaN
cols = ['A','B','C']
for col in cols:
df_test[col] = df_test[col].str.split('|')
df_test[col] = df_test[col].fillna({i: [] for i in df_test.index}) #tried replace the NaN with a null list but same error
df_long = df_test.explode(cols)

Transpose two or more columns in a dataframe

I have a dataframe which looks like:
PRIO Art Name Value
1 A Alpha 0
1 A Alpha 0
1 A Beta 1
2 A Alpha 3
2 B Theta 2
How can I transpose the dataframe, that I have all unique names as a column with the corresponding values to it (note that duplicate rows I want to ignore)?
So in this case:
PRIO Art Alpha Alpha_value Beta Beta_value Theta Theta_value
1 A 1 0 1 1 NaN NaN
2 A 1 3 NaN NaN NaN NaN
2 B NaN NaN NaN NaN 1 2
Here's one way using pivot_table. A few tricky things to keep in mind:
You need to specify both 'PRIO', 'Art' as pivot index
We can also use two aggregation funcs to get it done in a single call
We have to rename the level 0 columns to distinguish them. So you need to swap levels and rename
out = df.pivot_table(index=['PRIO', 'Art'], columns='Name', values='Value',
aggfunc=[lambda x: 1, 'first'])
# get the column names right
d = {'<lambda>':'is_present', 'first':'value'}
out = out.rename(columns=d, level=0)
out.columns = out.swaplevel(1,0, axis=1).columns.map('_'.join)
print(out.reset_index())
PRIO Art Alpha_is_present Beta_is_present Theta_is_present Alpha_value \
0 1 A 1.0 1.0 NaN 0.0
1 2 A 1.0 NaN NaN 3.0
2 2 B NaN NaN 1.0 NaN
Beta_value Theta_value
0 1.0 NaN
1 NaN NaN
2 NaN 2.0
Groupby twice, first to pivot Name and suffix with value. Next groupby same imperatives and find unique values. Join the two.In the joining, drop the duplicate columns and rename others as appropriate
g=df.groupby([ 'Art','PRIO', 'Name'])['Value'].\
first().unstack().reset_index().add_suffix('_value')
print(g.join(df.groupby(['PRIO', 'Art','Name'])['Value'].\
nunique().unstack('Name').reset_index()).drop(columns=['PRIO_value','Art'])\
.rename(columns={'Art_value':'Art'}))
Name Art Alpha_value Beta_value Theta_value PRIO Alpha Beta Theta
0 A 0.0 1.0 NaN 1 1.0 1.0 NaN
1 A 3.0 NaN NaN 2 1.0 NaN NaN
2 B NaN NaN 2.0 2 NaN NaN 1.0
This is an example of pd.crosstab() and groupby().
df = pd.concat([pd.crosstab([df['PRIO'],df['Art']], df['Name']),df.groupby(['PRIO','Art','Name'])['Value'].sum().unstack().add_suffix('_value')],axis=1).reset_index()
df
| | Alpha | Beta | Theta | Alpha_value | Beta_value | Theta_value |
|:---------|--------:|-------:|--------:|--------------:|-------------:|--------------:|
| (1, 'A') | 1 | 1 | 0 | 0 | 1 | nan |
| (2, 'A') | 1 | 0 | 0 | 3 | nan | nan |
| (2, 'B') | 0 | 0 | 1 | nan | nan | 2 |

Categories