How to merge rows in pandas dataframe [duplicate] - python

This question already has answers here:
pandas group by and find first non null value for all columns
(3 answers)
Merge each group's rows into one row
(1 answer)
Closed 1 year ago.
I have a dataframe that looks like this:
productID
units sold
units in inventory
101
32
NaN
102
45
NaN
103
15
NaN
104
27
NaN
101
NaN
18
102
NaN
12
103
NaN
30
104
NaN
23
As you can see, the first column contains duplicates, where each instance has data in one 'data' column, but not the other 'data' column.
Is there a way to merge the rows, so the dataframe looks like this?
productID
units sold
units in inventory
101
32
18
102
45
12
103
15
30
104
27
23

Try groupby.first:
>>> df.groupby('productID', as_index=False).first()
productID units sold units in inventory
0 101 32.0 18.0
1 102 45.0 12.0
2 103 15.0 30.0
3 104 27.0 23.0
>>>

Related

Select data, resample and join horizontally in Python

I have two dataframes df1 and df2.
df1:
Id Date Remark
0 28 2010-04-08 xx
1 29 2010-10-10 yy
2 30 2012-12-03 zz
3 31 2010-03-16 aa
df2:
Id Timestamp Value Site
0 28 2010-04-08 13:20:15.120 125.0 93
1 28 2010-04-08 13:20:16.020 120.0 94
2 28 2010-04-08 13:20:18.020 135.0 95
3 28 2010-04-08 13:20:18.360 140.0 96
...
1000 29 2010-06-15 05:04:15.120 16.0 101
1001 29 2010-06-15 05:05:16.320 14.0 101
...
I would like to select all Value data 10 days before/including the Date in df1 from df2 for the same Id. For example, for Id 28, Date is 2010-04-08, so select Value where Timestamp is between 2010-03-30 00:00:00 and 2010-04-08 23:59:59(inclusive).
Then, I want to resample the Value data using forward fill ffill and backward fill bfill at 1min frequency so that there will be 10 x 24 x 60 = 14400 values exactly for each Id.
Lastly, I'd like to rearrange the dataframe horizontally with transpose.
Expected output looks like this:
Id Date value1 value2 ... value14399 value14400 Remark
0 28 2010-04-08 125.0 125.0 ... ... xx (value1 and value2 and following values before "2010-04-08 13:20:15.120" are 125.0 as a result of backward fill since the first value for Id 29 is 125.0)
1 29 2010-10-10 16.0 16.0 ... yy
...
I am not sure what's the best way to approach this problem since I'm adding "another time series dimension" to the dataframe. Any idea is appreciated.

How to add zero values in beginning and end of Pandas?

I have the following dataframe
Inbound Value
1 Nan
2 Nan
3 Nan
4 ...
5 ...
19 Nan
20 130
21 130
22 140
23 140
24 170
25 170
25 170
26 ...
27 210
28 Nan
29 Nan
30 ...
.. ...
131 Nan
I would like to drop most of values which are Nan but keeping only 11 first values and keep also the last 11 Nan.
I know that data = data.dropna() drop all Nan values but I want to have as I described.
Use numpy's r_ to set range of indexes to a value and then drop remaining NaN
df.iloc[pd.np.r_[0:10, -11:0], df.columns.get_loc('Inbound Value')] = 0
data.dropna()

dataframe concatenating with indexing

I have a Python dataframe that reads from a file
the next step I do is to break dataset into 2 datasets df_LastYear & df_ThisYear
Note : that Index is not continuous missing 2 & 6
ID AdmissionAge
0 14 68
1 22 86
3 78 40
4 124 45
5 128 35
7 148 92
8 183 71
9 185 98
10 219 79
after applying some predictive models I get results of predictive values y_ThisYear
Prediction
0 2.400000e+01
1 1.400000e+01
2 1.000000e+00
3 2.096032e+09
4 2.000000e+00
5 -7.395179e+11
6 6.159412e+06
7 5.592327e+07
8 5.303477e+08
9 5.500000e+00
10 6.500000e+00
I am trying to concat both datasets df_ThisYear and y_ThisYear into one dataset
but I always get these results
ID AdmissionAge Prediction
0 14.0 68.0 2.400000e+01
1 22.0 86.0 1.400000e+01
2 NaN NaN 1.000000e+00
3 78.0 40.0 2.096032e+09
4 124.0 45.0 2.000000e+00
5 128.0 35.0 -7.395179e+11
6 NaN NaN 6.159412e+06
7 148.0 92.0 5.592327e+07
8 183.0 71.0 5.303477e+08
9 185.0 98.0 5.500000e+00
10 219.0 79.0 6.500000e+00
There are NaNs which did not exist before
I found that these NaNs are belonging to the index which was not included in df_ThisYear
Therefore I try reset index so I get continuous Indices
I used
df_ThisYear.reset_index(drop=True)
but still getting same indices
How to fix this problem so I can concatenate df_ThisYear with y_ThisYear correctly?
Then you just need join
df.join(Y)
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
If you are really excited about using concat, you can provide 'inner' to the how argument:
pd.concat([df_ThisYear, y_ThisYear], axis=1, join='inner')
This returns
Out[6]:
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
Because y_ThisYear has different index than df_ThisYear
When I joined both using
df_ThisYear.join(y_ThisYear )
it started to match each number it its matching index
I know this is right if indices are actually represent the same record i.e. index 7 in df_ThisYear value is matching y_ThisYear index 7 too
In my case I just want to match first record in y_ThisYear to the first in df_ThisYear regardless of their index number
I found this code that does that.
df_ThisYear = pd.concat([df_ThisYear.reset_index(drop=True), pd.DataFrame(y_ThisYear)], axis=1)
Thanks for everyone helped with the answer

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

Iterating over groups in a dataframe [duplicate]

This question already has answers here:
Looping over groups in a grouped dataframe
(2 answers)
Closed 4 years ago.
The issue I am having is that I want to group the dataframe and then use functions to manipulate the data after its been grouped. For example I want to group the data by Date and then iterate through each row in the date groups to parse to a function?
The issue is groupby seems to create a tuple of the key and then a massive string consisting of all of the rows in the data making iterating through each row impossible
When you apply groupby on a dataframe, you don't get rows, you get groups of dataframe. For example, consider:
df
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
for i, g in df.groupby('ID'):
print(g, '\n')
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
ID Date Days Volume/Day
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
For your case, you should probably look into dfGroupby.apply, if you want to apply some function on your groups, dfGroupby.transform to produce like indexed dataframe (see docs for explanation) or dfGroupby.agg, if you want to produce aggregated results.
You'd do something like:
r = df.groupby('Date').apply(your_function)
You'd define your function as:
def your_function(df):
... # operation on df
return result
If you have problems with the implementation, please open a new question, post your data and your code, and any associated errors/tracebacks. Happy coding.

Categories