I have the following pandas dataframe
| A | B |
| :-|:------:|
| 1 | [2,3,4]|
| 2 | np.nan |
| 3 | np.nan |
| 4 | 10 |
I would like to unlist the first row and place those values sequentially in the subsequent rows. The outcome will look like this:
| A | B |
| :-|:------:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 10 |
How can I achieve this in a very large dataset with this phenomena occurring in many rows?
If the number of NaN values serve as a "slack" space, so that list elements can slot in, i.e. if the lengths match, then you can explode columns "B", then drop NaN values with dropna, reset index and assign back to "B":
df['B'] = df['B'].explode().dropna().reset_index(drop=True)
Output:
A B
0 1 2
1 2 3
2 3 4
3 4 10
As the number of consecutive NaNs does not match the length of the list, you can make groups starting with non NaN elements and explode while keeping the length of the group constant.
I used a slightly different example for clarity (I also assigned to a different column):
df['C'] = (df['B']
.groupby(df['B'].notna().cumsum())
.apply(lambda s: s.explode().iloc[:len(s)])
.values
)
Output:
A B C
0 1 [2, 3, 4] 2
1 2 NaN 3
2 3 NaN 4
3 4 NaN NaN
4 5 10 10
Used input:
df = pd.DataFrame({'A': range(1,6),
'B': [[2,3,4], np.nan, np.nan, np.nan, 10]
})
Related
I'm currently trying to parse excel files that contain somewhat structured information. The data I am interested in is in a subrange of an excel sheet. Basically the excel contains key-value pairs where the key is usually named in a predictable manner (found with regex). Keys are in the same column and the value pair is on the right side of the key in the excel sheet.
Regex pattern pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment' predictably matches the keys. Therefore if I can find the column where the keys are located and the rows where the keys are present, I am able to find the subrange of interest and parse it further.
Goals:
Get list of row indices that match regex (e.g. [5, 6, 8, 9])
Find which column contains keys that match regex (e.g. Unnamed: 3)
When I read in the excel using df_original = pd.read_excel(filename, sheet_name=sheet) the dataframe looks like this
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
This code finds the rows of interest and solves Goal 1.
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
What is the easiest way to solve Goal 2? Basically I want to find columns that contain at least one value that matches the regex pattern. The wanted output would be [Unnamed: 5]. There may be some easy way to solve goals 1 and 2 at the same time. For example:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
+----+-----------------+--------------+--------------+--------------+
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+-----------------+--------------+--------------+--------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
+----+-----------------+--------------+--------------+--------------+
One option is with xlsx_cells from pyjanitor; it reads each cell as a single row; this way you are afforded more manipulation freedom; for your use case it can be handy and an alternative:
# pip install pyjanitor
import pandas as pd
import janitor as jn
Read in data
df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
value internal_value coordinate row column data_type is_date number_format
0 Value Value A2 2 1 s False General
1 High High B2 2 2 s False General
2 Other Other D2 2 4 s False General
3 Other 2 Other 2 E2 2 5 s False General
4 Name Name A3 3 1 s False General
Filter for rows that match the pattern:
bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)
vals = df.loc[bools, ['value', 'row', 'column']]
vals
value row column
16 Temperature (C) 7 4
20 Strength 8 4
24 Temperature (F) 10 4
26 Comment 11 4
Look for values that are on the same row as vals, and are in columns greater than the column in vals:
bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())
result = df.loc[bools, ['value', 'row', 'column']]
result
value row column
17 25 7 6
21 1500 8 5
25 77 10 6
27 Looks OK 11 6
28 Add water 11 7
Merge vals and result to get the final output
(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
)
val row value
0 Temperature (C) 7 25
1 Strength 8 1500
2 Temperature (F) 10 77
3 Comment 11 Looks OK
4 Comment 11 Add water
Try one of the following 2 options:
Option 1 (assuming no not-NaN data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanation
First, we use df.stack to add column names as a level to the index, and get all the data just in one column.
Now, we can apply Series.str.contains to find a match for r'[Tt]emperature'. We chain Series.idxmax to "[r]eturn the row label of the maximum value". I.e. this will be the first True, so we will get back (5, 'Unnamed: 3'), to be stored in idx and col respectively.
Now, we know where to start our selection from the df, namely at index 5 and column Unnamed: 3. If we simply want all the data (to the right, and to bottom) from here on, we can use: df_original.loc[idx:, col:] and finally, drop all remaining rows that have only NaN values.
Option 2 (potential data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index
res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanantion
Basically, the procedure here is the same as with option 1, except that we want to retrieve all the index values, rather than just the first one (for "[Tt]emperature (C)"). After tmp[tmp].index, we get tmp as:
MultiIndex([(5, 'Unnamed: 3'),
(6, 'Unnamed: 3'),
(8, 'Unnamed: 3'),
(9, 'Unnamed: 3')],
)
In the next step, we use these values as coordinates for df.loc. I.e. for the index selection, we want all values, so we use index.get_level_values; for the column, we only need the first value (they should all be the same of course: Unnamed: 3).
I am trying to run a simple calculation over the values of each row from within a group inside of a dataframe, but I'm having trouble with the syntax, I think I'm specifically getting confused in relation to what data object I should return, i.e. dataframe vs series etc.
For context, I have a bunch of stock values for each product I am tracking and I want to estimate the number of sales via a custom function which essentially does the following:
# Because stock can go up and down, I'm looking to record the difference
# when the stock is less than the previous stock number from the previous row.
# How do I access each row of the dataframe and then return the series I need?
def get_stock_sold(x):
# Written in pseudo
stock_sold = previous_stock_no - current_stock_no if current_stock_no < previous_stock_no else 0
return pd.Series(stock_sold)
I then have the following dataframe:
# 'Order' is a date in the real dataset.
data = {
'id' : ['1', '1', '1', '2', '2', '2'],
'order' : [1, 2, 3, 1, 2, 3],
'current_stock' : [100, 150, 90, 50, 48, 30]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['id', 'order'])
df['previous_stock'] = df.groupby('id')['current_stock'].shift(1)
I'd like to create a new column (stock_sold) and apply the logic from above to each row within the grouped dataframe object:
df['stock_sold'] = df.groupby('id').apply(get_stock_sold)
Desired output would look as follows:
| id | order | current_stock | previous_stock | stock_sold |
|----|-------|---------------|----------------|------------|
| 1 | 1 | 100 | NaN | 0 |
| | 2 | 150 | 100.0 | 0 |
| | 3 | 90 | 150.0 | 60 |
| 2 | 1 | 50 | NaN | 0 |
| | 2 | 48 | 50.0 | 2 |
| | 3 | 30 | 48 | 18 |
Try:
df["previous_stock"] = df.groupby("id")["current_stock"].shift()
df["stock_sold"] = np.where(
df["current_stock"] > df["previous_stock"].fillna(0),
0,
df["previous_stock"] - df["current_stock"],
)
print(df)
Prints:
id order current_stock previous_stock stock_sold
0 1 1 100 NaN 0.0
1 1 2 150 100.0 0.0
2 1 3 90 150.0 60.0
3 2 1 50 NaN 0.0
4 2 2 48 50.0 2.0
5 2 3 30 48.0 18.0
I have a concept of what I need to do, but I can't write the right code to run, please take a look and give some advice.
step 1. find the rows that contains values in the second column
step 2. with those rows, compare the value in the first column with their previous row
step 3. drop the rows with larger first column value
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | 60 |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 1 | 180 |
| 1 | nan |
| 0 | 120 |
eg. I want to compare the missing values with the rows values in diff [120,180,60] and their previous rows. in the end, the desire dataframe will look like
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 0 | 120 |
update question according to the answer, got the same df as original df
import pandas as pd
import numpy as np
data={'missing':[0,1,1,0,0,1,1,0],'diff':[np.nan,60,np.nan,np.nan,np.nan,180,np.nan,120]}
df=pd.DataFrame(data)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
if df['diff'][ind]!=np.nan:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
IIUC, you can try:
m = df['diff'].notna()
df = (
pd.concat([
df[df['diff'].isna()],
df[m][df[m.shift(-1).fillna(False)]['missing'].values >
df[m]['missing'].values]
])
)
OUTPUT:
missing diff
1 0 <NA>
3 1 <NA>
4 0 <NA>
5 0 <NA>
7 1 <NA>
8 0 120
This will work for sure
for ind in df.index:
if np.isnan(df['diff'][ind])==False:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
This will work
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
import pandas as pd #import pandas
#define dictionary
data={'missing':[0,1,1,0,0,1,1,0],'diff':[nan,60,nan,nan,nan,180,nan,120]}
#dictionary to dataframe
df=pd.DataFrame(data)
print(df)
#for each row in dataframe
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
#only each row whose diff value is a number
#find the rows that contains values in the second column and compare it with previous value
if df['missing'][ind]>df['missing'][ind-1]:
#drop the rows with larger first column value
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
print(df)
I am looking for a an efficient and elegant way in Pandas to remove "duplicate" rows in a DataFrame that have exactly the same value set but in different columns.
I am ideally looking for a vectorized way to do this as I can already identify very inefficient ways using the Pandas pandas.DataFrame.iterrows() method.
Say my DataFrame is:
source|target|
----------------
| 1 | 2 |
| 2 | 1 |
| 4 | 3 |
| 2 | 7 |
| 3 | 4 |
I want it to become:
source|target|
----------------
| 1 | 2 |
| 4 | 3 |
| 2 | 7 |
df = df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
source target
0 1 2
2 4 3
3 2 7
explanation:
np.sort(df.values,axis=1) is sorting DataFrame column wise
array([[1, 2],
[1, 2],
[3, 4],
[2, 7],
[3, 4]], dtype=int64)
then making a dataframe from it and checking non duplicated using prefix ~ on duplicated
~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()
0 True
1 False
2 True
3 True
4 False
dtype: bool
and using this as mask getting final output
source target
0 1 2
2 4 3
3 2 7
I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.
My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.