Pandas Merge Rows with Duplicate Ids Conditionally, Suitable for to CSV - python

I have the following df and I want to merge the lines that have the same Ids, unless there are duplicates
Ids A B C D E F G H I J
4411 24 2 55 26 1
4411 24 2 54 26 0
4412 22 4 54 26 0
4412 18 8 54 26 0
7401 12 14 54 26 0
7401 0 25 53 26 0
7402 24 2 54 26 0
7402 25 1 54 26 0
10891 16 10 54 26 0
10891 3 23 54 26 0
10891 5 10 6 15 0
Example output
Ids A B C D E F G H I J
4411 24 2 55 26 1 24 2 54 26 0
4412 22 4 54 26 0 18 8 54 26 0
7401 12 14 54 26 0 0 25 53 26 0
7402 24 2 54 26 0 25 1 54 26 0
10891 16 10 54 26 0 3 23 54 26 0
10891 5 10 6 15 0
I tried groupby but that throws errors when you write to csv.

This solution uses Divakar's justify function. If needed, convert to numeric in advance:
df = df.apply(pd.to_numeric, errors='coerce', axis=1)
Now, call groupby + transform:
df.set_index('Ids')\
.groupby(level=0)\
.transform(
justify, invalid_val=np.nan, axis=0, side='up'
)\
.dropna(how='all')
A B C D E F G H I J
Ids
4411 24.0 2.0 55.0 26.0 1.0 24.0 2.0 54.0 26.0 0.0
4412 22.0 4.0 54.0 26.0 0.0 18.0 8.0 54.0 26.0 0.0
7401 12.0 14.0 54.0 26.0 0.0 0.0 25.0 53.0 26.0 0.0
7402 24.0 2.0 54.0 26.0 0.0 25.0 1.0 54.0 26.0 0.0
10891 16.0 10.0 54.0 26.0 0.0 3.0 23.0 54.0 26.0 0.0
10891 NaN NaN NaN NaN NaN 5.0 10.0 6.0 15.0 0.0

This should be slow , but can achieve what you need
df.replace('',np.nan).groupby('Ids').apply(lambda x: pd.DataFrame(x).apply(lambda x: sorted(x, key=pd.isnull),0)).dropna(axis=0,thresh=2).fillna('')
Out[539]:
Ids A B C D E F G H I J
0 7402 24.0 2.0 54.0 26.0 0.0 25.0 1.0 54.0 26.0 0.0
2 10891 16.0 10.0 54.0 26.0 0.0 3.0 23.0 54.0 26.0 0.0
3 10891 5.0 10.0 6.0 15.0 0.0

Assuming all the blank values are nan, another option using groupby and dropna:
df.loc[:,'A':'E'] = df.groupby('Ids').apply(lambda x: x.loc[:,'A':'E'].ffill(limit=1))
df.dropna(subset=['F','G','H','I','J'])

Related

Pandas rolling but involves last rows value

I have this dataframe
hour = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
visitor = [4,6,2,4,3,7,5,7,8,3,2,8,3,6,4,5,1,8,9,4,2,3,4,1]
df = {"Hour":hour, "Total_Visitor":visitor}
df = pd.DataFrame(df)
print(df)
I applied 6 window rolling sum
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
The first 5 rows will give you NaN value,
The problem is I want to know the sum of total visitor from 9pm to 3am, so I have to sum total visitor from hour 21 and then back to hour 0 until 3
How do you do that automatically with rolling?
I think you need add last N values, then using rolling and filter by length of Series:
N = 6
df_roll = df.iloc[-N:].append(df).rolling(N).sum().iloc[-len(df):]
print (df_roll)
Hour Total_Visitor
0 105.0 18.0
1 87.0 20.0
2 69.0 20.0
3 51.0 21.0
4 33.0 20.0
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Check original solution:
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
Hour Total_Visitor
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Numpy alternative with strides is complicated, but faster if large one Series:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
N = 3
x = np.concatenate([fv[-N+1:], fv.to_numpy()])
cv = pd.Series(rolling_window(x, N).sum(axis=1), index=fv.index)
print (cv)
0 5
1 4
2 4
3 6
4 5
dtype: int64
Though you have mentioned a series, see if this is helpful-
import pandas as pd
def cyclic_roll(s, n):
s = s.append(s[:n-1])
result = s.rolling(n).sum()
return result[-n+1:].append(result[n-1:-n+1])
fv = pd.DataFrame([1, 2, 3, 4, 5])
cv = fv.apply(cyclic_roll, n=3)
cv.reset_index(inplace=True, drop=True)
print cv
Output
0
0 10.0
1 8.0
2 6.0
3 9.0
4 12.0

Rearranging a Pandas dataframe

I have this input worksheet:
raw = pd.read_excel("raw.xlsx", header=None)
>Out[4]:
0 1 2 3 4
0 48 59.0 28.0 6.0 36.0
1 41 36.0 52.0 3.0 22.0
2 32 86.0 66.0 68.0 9.0
3 71 23.0 6.0 98.0 19.0
4 18 92.0 66.0 6.0 54.0
5 Andy NaN NaN NaN NaN
6 56 89.0 6.0 32.0 50.0
7 3 68.0 49.0 93.0 15.0
8 27 65.0 94.0 96.0 66.0
9 40 96.0 71.0 22.0 83.0
10 96 23.0 5.0 49.0 14.0
11 Bob NaN NaN NaN NaN
12 43 34.0 42.0 11.0 73.0
13 42 41.0 17.0 91.0 35.0
14 81 74.0 24.0 95.0 95.0
15 89 57.0 35.0 66.0 56.0
16 54 76.0 55.0 72.0 63.0
17 David NaN NaN NaN NaN
18 58 8.0 62.0 63.0 8.0
19 15 93.0 97.0 38.0 5.0
20 13 96.0 42.0 51.0 48.0
21 23 88.0 20.0 91.0 39.0
22 9 67.0 45.0 58.0 92.0
23 Bill NaN NaN NaN NaN
24 2 3.0 80.0 28.0 38.0
25 100 68.0 83.0 26.0 45.0
26 79 57.0 40.0 76.0 83.0
27 12 98.0 76.0 63.0 53.0
28 60 88.0 70.0 13.0 50.0
29 Luke NaN NaN NaN NaN
I have to rearrange the "block' of data on a single line, followed by the name.
The format is fixed, and the output should look like this
>Out[6]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 48 59 28 6 36 41 36 52 3 22 32 86 66 68 9 71 23 6 98 19 18 92 66 6 54 Andy
1 56 89 6 32 50 3 68 49 93 15 27 65 94 96 66 40 96 71 22 83 96 23 5 49 14 Bob
2 43 34 42 11 73 42 41 17 91 35 81 74 24 95 95 89 57 35 66 56 54 76 55 72 63 David
3 58 8 62 63 8 15 93 97 38 5 13 96 42 51 48 23 88 20 91 39 9 67 45 58 92 Bill
4 2 3 80 28 38 100 68 83 26 45 79 57 40 76 83 12 98 76 63 53 60 88 70 13 50 Luke
Which is the most pythonic way to do this?
Thanks
First part of code is many lines that read-in data from string lines, this is just for example, you need to do your pd.read_excel() instead.
The real part of code is conversion which is in few next lines:
a = df.values
a = a.reshape([a.size // 30, 30])
a = a[:, :-4]
df = pd.DataFrame(a)
This above can be even shortened to one-liner:
df = pd.DataFrame(df.values.reshape((-1, 30))[:, :-4])
Full code down below:
Try it online!
import pandas as pd, numpy as np
# Next is just reading-in my data in a fancy way,
# you do pd.read_excel(file) instead like you did before
df = pd.DataFrame([line.split() for line in """
48 59.0 28.0 6.0 36.0
41 36.0 52.0 3.0 22.0
32 86.0 66.0 68.0 9.0
71 23.0 6.0 98.0 19.0
18 92.0 66.0 6.0 54.0
Andy NaN NaN NaN NaN
56 89.0 6.0 32.0 50.0
3 68.0 49.0 93.0 15.0
27 65.0 94.0 96.0 66.0
40 96.0 71.0 22.0 83.0
96 23.0 5.0 49.0 14.0
Bob NaN NaN NaN NaN
43 34.0 42.0 11.0 73.0
42 41.0 17.0 91.0 35.0
81 74.0 24.0 95.0 95.0
89 57.0 35.0 66.0 56.0
54 76.0 55.0 72.0 63.0
David NaN NaN NaN NaN
58 8.0 62.0 63.0 8.0
15 93.0 97.0 38.0 5.0
13 96.0 42.0 51.0 48.0
23 88.0 20.0 91.0 39.0
9 67.0 45.0 58.0 92.0
Bill NaN NaN NaN NaN
2 3.0 80.0 28.0 38.0
100 68.0 83.0 26.0 45.0
79 57.0 40.0 76.0 83.0
12 98.0 76.0 63.0 53.0
60 88.0 70.0 13.0 50.0
Luke NaN NaN NaN NaN
""".splitlines() if line.strip()])
a = df.values
a = a.reshape([a.size // 30, 30])
a = a[:, :-4]
df = pd.DataFrame(a)
print(df)
Outputs:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 48 59.0 28.0 6.0 36.0 41 36.0 52.0 3.0 22.0 32 86.0 66.0 68.0 9.0 71 23.0 6.0 98.0 19.0 18 92.0 66.0 6.0 54.0 Andy
1 56 89.0 6.0 32.0 50.0 3 68.0 49.0 93.0 15.0 27 65.0 94.0 96.0 66.0 40 96.0 71.0 22.0 83.0 96 23.0 5.0 49.0 14.0 Bob
2 43 34.0 42.0 11.0 73.0 42 41.0 17.0 91.0 35.0 81 74.0 24.0 95.0 95.0 89 57.0 35.0 66.0 56.0 54 76.0 55.0 72.0 63.0 David
3 58 8.0 62.0 63.0 8.0 15 93.0 97.0 38.0 5.0 13 96.0 42.0 51.0 48.0 23 88.0 20.0 91.0 39.0 9 67.0 45.0 58.0 92.0 Bill
4 2 3.0 80.0 28.0 38.0 100 68.0 83.0 26.0 45.0 79 57.0 40.0 76.0 83.0 12 98.0 76.0 63.0 53.0 60 88.0 70.0 13.0 50.0 Luke

Python add columns to DataFame with rolling window based on 3 previous rows

I have a dataframe like this:
original = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=["P1_day", "P1_week", "P1_month"])
print(original)
P1_day P1_week P1_month
0 50 17 55
1 45 3 10
2 93 79 84
3 99 38 33
4 44 35 35
5 25 43 87
6 38 88 56
7 20 66 6
8 4 23 6
9 39 75 3
I need to generate new dataframe starting from 3rd row of original dataframe and add new 9 columns based on rolling window defined as 3 previous rows with corresponding prefixes: [_0,_1, _2]. So, It's rows with index [0,1,2] from original dataframe .
For example, the next 3 columns will be from the original.iloc[0],
and after the next 3 columns will be from the original.iloc[1],
and the last 3 columns will be from the original.iloc[2]
I tried to solve it by the next code:
subset_shifted = original[["P1_day", "P1_week", "P1_month"]].shift(3)
subset_shifted.columns = ["P1_day_0", "P1_week_0", "P1_month_0"]
original_ = pd.concat([original, subset_shifted], axis = 1)
print(original_)
In result, I Have 3 additional columns with value from the previous 0 row:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 50 17 55 NaN NaN NaN
1 45 3 10 NaN NaN NaN
2 93 79 84 NaN NaN NaN
3 99 38 33 50.0 17.0 55.0
4 44 35 35 45.0 3.0 10.0
5 25 43 87 93.0 79.0 84.0
6 38 88 56 99.0 38.0 33.0
7 20 66 6 44.0 35.0 35.0
8 4 23 6 25.0 43.0 87.0
9 39 75 3 38.0 88.0 56.0
In the next iteration I did shift(2) with the same approach and received the columns from the original.iloc[1].
On the last iteration I did shift(1) and got expected result in view of:
result = original_.iloc[3:]
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 99 38 33 50.0 17.0 55.0 45.0 3.0 10.0 93.0 79.0 84.0
4 44 35 35 45.0 3.0 10.0 93.0 79.0 84.0 99.0 38.0 33.0
5 25 43 87 93.0 79.0 84.0 99.0 38.0 33.0 44.0 35.0 35.0
6 38 88 56 99.0 38.0 33.0 44.0 35.0 35.0 25.0 43.0 87.0
7 20 66 6 44.0 35.0 35.0 25.0 43.0 87.0 38.0 88.0 56.0
8 4 23 6 25.0 43.0 87.0 38.0 88.0 56.0 20.0 66.0 6.0
9 39 75 3 38.0 88.0 56.0 20.0 66.0 6.0 4.0 23.0 6.0
Question:
Is there any way to solve this task with better approach as I described? Thanks.
Unless you want all these extra DataFrames, you could just add the new columns to your orignal df directly:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_0", "P1_week_0", "P1_month_0"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(3)
print(original)
output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 2 35 26 NaN NaN NaN
1 99 4 96 NaN NaN NaN
2 4 67 6 NaN NaN NaN
3 76 33 31 2.0 35.0 26.0
4 84 60 98 99.0 4.0 96.0
5 57 1 58 4.0 67.0 6.0
6 35 70 96 76.0 33.0 31.0
7 81 32 39 84.0 60.0 98.0
8 25 4 38 57.0 1.0 58.0
9 83 4 60 35.0 70.0 96.0
python tutor link to example
Edit: OP asked the follow up question:
yes, for the first row it makes sense. But, my task is to add first 3 rows with index 0-1-2 as new 9 columns for the respected rows started from 3rd index. In your output row with index 1st is not added to the 3rd row as 3 columns. In my code that's why I used shift(2) and shift(1) iteratively.
Here is how this could be done iteratively:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
for shift, n in ((3,0),(2,1),(1,2)):
original[
[f"P1_day_{n}", f"P1_week_{n}", f"P1_month_{n}"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(shift)
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 \
3 58 43 74 26.0 56.0 82.0 56.0
4 44 27 40 56.0 87.0 38.0 31.0
5 2 90 4 31.0 32.0 87.0 58.0
6 90 70 6 58.0 43.0 74.0 44.0
7 1 31 57 44.0 27.0 40.0 2.0
8 96 22 69 2.0 90.0 4.0 90.0
9 13 98 47 90.0 70.0 6.0 1.0
P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 87.0 38.0 31.0 32.0 87.0
4 32.0 87.0 58.0 43.0 74.0
5 43.0 74.0 44.0 27.0 40.0
6 27.0 40.0 2.0 90.0 4.0
7 90.0 4.0 90.0 70.0 6.0
8 70.0 6.0 1.0 31.0 57.0
9 31.0 57.0 96.0 22.0 69.0
python tutor link
Edit 2: Not to make any assumptions here, but if your end goal is to get something like the 4 period moving average from the data in all of these new columns then you might not need them at all. You can use pandas.DataFrame.rolling instead:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_4PMA", "P1_week_4PMA", "P1_month_4PMA"]
] = original[
["P1_day", "P1_week", "P1_month"]
].rolling(4).mean()
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_4PMA P1_week_4PMA P1_month_4PMA
3 1 13 48 31.25 38.00 55.00
4 10 4 40 22.00 21.00 45.75
5 7 76 0 5.50 23.75 37.00
6 5 69 9 5.75 40.50 24.25
7 63 31 82 21.25 45.00 32.75
8 26 67 22 25.25 60.75 28.25
9 89 41 40 45.75 52.00 38.25
another python tutor link

Cumsum of fixed amount of elements without loops

I have Python which counts cumsum of 14 elements in column, starting from different elements and writes this sum in other column. Does anyone knows the way how to do it without loops?
import pandas as pd
import numpy as np
a = pd.DataFrame({"A": [i for i in range(25)]})
b = pd.DataFrame({"B": [np.nan for i in range(25)]})
for i in range(4, len(b)):
cumsum = 0
for k in range(i - 4, i):
cumsum += a.A[k]
b.B[k] = cumsum
pd.concat([a,b], axis=1)
IIUC you are looking for rolling(4) + sum():
In [83]: a['new'] = a.A.rolling(4).sum()
In [84]: a
Out[84]:
A new
0 0 NaN
1 1 NaN
2 2 NaN
3 3 6.0
4 4 10.0
5 5 14.0
6 6 18.0
7 7 22.0
8 8 26.0
9 9 30.0
10 10 34.0
11 11 38.0
12 12 42.0
13 13 46.0
14 14 50.0
15 15 54.0
16 16 58.0
17 17 62.0
18 18 66.0
19 19 70.0
20 20 74.0
21 21 78.0
22 22 82.0
23 23 86.0
24 24 90.0
check:
In [86]: pd.concat([a,b], axis=1)
Out[86]:
A new B
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 6.0 6.0
4 4 10.0 10.0
5 5 14.0 14.0
6 6 18.0 18.0
7 7 22.0 22.0
8 8 26.0 26.0
9 9 30.0 30.0
10 10 34.0 34.0
11 11 38.0 38.0
12 12 42.0 42.0
13 13 46.0 46.0
14 14 50.0 50.0
15 15 54.0 54.0
16 16 58.0 58.0
17 17 62.0 62.0
18 18 66.0 66.0
19 19 70.0 70.0
20 20 74.0 74.0
21 21 78.0 78.0
22 22 82.0 82.0
23 23 86.0 86.0
24 24 90.0 NaN

pandas adding new column?

l m learning Pandas and l m trying to learn things in it. However ,l got stuck in adding new column as new columns have larger index number.l would like to add more than 3 columns.
here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
example=pd.read_excel("C:/Users/ömer sarı/AppData/Local/Programs/Python/Python35-32/data_analysis/pydata-book-master/example.xlsx",names=["a","b","c","d","e","f","g"])
dropNan=example.dropna()
#print(dropNan)
Fillup=example.fillna(-99)
#print(Fillup)
Countt=Fillup.get_dtype_counts()
#print(Countt)
date=pd.date_range("2014-01-01","2014-01-15",freq="D")
#print(date)
mm=date.month
yy=date.year
dd=date.day
df=pd.DataFrame(example)
print(df)
x=[i for i in yy]
print(x)
df["year"]=df[x]
here is example dataset:
a b c d e f g
0 1 1.0 2.0 5 3 11.0 57.0
1 2 4.0 6.0 10 6 22.0 59.0
2 3 9.0 12.0 15 9 33.0 61.0
3 4 16.0 20.0 20 12 44.0 63.0
4 5 25.0 NaN 25 15 NaN 65.0
5 6 NaN 42.0 30 18 66.0 NaN
6 7 49.0 56.0 35 21 77.0 69.0
7 8 64.0 72.0 40 24 88.0 71.0
8 9 81.0 NaN 45 27 99.0 73.0
9 10 NaN 110.0 50 30 NaN 75.0
10 11 121.0 NaN 55 33 121.0 77.0
11 12 144.0 156.0 60 36 132.0 NaN
12 13 169.0 182.0 65 39 143.0 81.0
13 14 196.0 NaN 70 42 154.0 83.0
14 15 225.0 240.0 75 45 165.0 85.0
here is error message:
IndexError: indices are out-of-bounds
after that ,l tried that one and l got new error:
df=pd.DataFrame(range(len(x)),index=x, columns=["a","b","c","d","e","f","g"])
pandas.core.common.PandasError: DataFrame constructor not properly called!
it is just a trial to learn and how can l add the date with split parts as new column like["date","year","month","day"....]

Categories