I have this input worksheet:
raw = pd.read_excel("raw.xlsx", header=None)
>Out[4]:
0 1 2 3 4
0 48 59.0 28.0 6.0 36.0
1 41 36.0 52.0 3.0 22.0
2 32 86.0 66.0 68.0 9.0
3 71 23.0 6.0 98.0 19.0
4 18 92.0 66.0 6.0 54.0
5 Andy NaN NaN NaN NaN
6 56 89.0 6.0 32.0 50.0
7 3 68.0 49.0 93.0 15.0
8 27 65.0 94.0 96.0 66.0
9 40 96.0 71.0 22.0 83.0
10 96 23.0 5.0 49.0 14.0
11 Bob NaN NaN NaN NaN
12 43 34.0 42.0 11.0 73.0
13 42 41.0 17.0 91.0 35.0
14 81 74.0 24.0 95.0 95.0
15 89 57.0 35.0 66.0 56.0
16 54 76.0 55.0 72.0 63.0
17 David NaN NaN NaN NaN
18 58 8.0 62.0 63.0 8.0
19 15 93.0 97.0 38.0 5.0
20 13 96.0 42.0 51.0 48.0
21 23 88.0 20.0 91.0 39.0
22 9 67.0 45.0 58.0 92.0
23 Bill NaN NaN NaN NaN
24 2 3.0 80.0 28.0 38.0
25 100 68.0 83.0 26.0 45.0
26 79 57.0 40.0 76.0 83.0
27 12 98.0 76.0 63.0 53.0
28 60 88.0 70.0 13.0 50.0
29 Luke NaN NaN NaN NaN
I have to rearrange the "block' of data on a single line, followed by the name.
The format is fixed, and the output should look like this
>Out[6]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 48 59 28 6 36 41 36 52 3 22 32 86 66 68 9 71 23 6 98 19 18 92 66 6 54 Andy
1 56 89 6 32 50 3 68 49 93 15 27 65 94 96 66 40 96 71 22 83 96 23 5 49 14 Bob
2 43 34 42 11 73 42 41 17 91 35 81 74 24 95 95 89 57 35 66 56 54 76 55 72 63 David
3 58 8 62 63 8 15 93 97 38 5 13 96 42 51 48 23 88 20 91 39 9 67 45 58 92 Bill
4 2 3 80 28 38 100 68 83 26 45 79 57 40 76 83 12 98 76 63 53 60 88 70 13 50 Luke
Which is the most pythonic way to do this?
Thanks
First part of code is many lines that read-in data from string lines, this is just for example, you need to do your pd.read_excel() instead.
The real part of code is conversion which is in few next lines:
a = df.values
a = a.reshape([a.size // 30, 30])
a = a[:, :-4]
df = pd.DataFrame(a)
This above can be even shortened to one-liner:
df = pd.DataFrame(df.values.reshape((-1, 30))[:, :-4])
Full code down below:
Try it online!
import pandas as pd, numpy as np
# Next is just reading-in my data in a fancy way,
# you do pd.read_excel(file) instead like you did before
df = pd.DataFrame([line.split() for line in """
48 59.0 28.0 6.0 36.0
41 36.0 52.0 3.0 22.0
32 86.0 66.0 68.0 9.0
71 23.0 6.0 98.0 19.0
18 92.0 66.0 6.0 54.0
Andy NaN NaN NaN NaN
56 89.0 6.0 32.0 50.0
3 68.0 49.0 93.0 15.0
27 65.0 94.0 96.0 66.0
40 96.0 71.0 22.0 83.0
96 23.0 5.0 49.0 14.0
Bob NaN NaN NaN NaN
43 34.0 42.0 11.0 73.0
42 41.0 17.0 91.0 35.0
81 74.0 24.0 95.0 95.0
89 57.0 35.0 66.0 56.0
54 76.0 55.0 72.0 63.0
David NaN NaN NaN NaN
58 8.0 62.0 63.0 8.0
15 93.0 97.0 38.0 5.0
13 96.0 42.0 51.0 48.0
23 88.0 20.0 91.0 39.0
9 67.0 45.0 58.0 92.0
Bill NaN NaN NaN NaN
2 3.0 80.0 28.0 38.0
100 68.0 83.0 26.0 45.0
79 57.0 40.0 76.0 83.0
12 98.0 76.0 63.0 53.0
60 88.0 70.0 13.0 50.0
Luke NaN NaN NaN NaN
""".splitlines() if line.strip()])
a = df.values
a = a.reshape([a.size // 30, 30])
a = a[:, :-4]
df = pd.DataFrame(a)
print(df)
Outputs:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 48 59.0 28.0 6.0 36.0 41 36.0 52.0 3.0 22.0 32 86.0 66.0 68.0 9.0 71 23.0 6.0 98.0 19.0 18 92.0 66.0 6.0 54.0 Andy
1 56 89.0 6.0 32.0 50.0 3 68.0 49.0 93.0 15.0 27 65.0 94.0 96.0 66.0 40 96.0 71.0 22.0 83.0 96 23.0 5.0 49.0 14.0 Bob
2 43 34.0 42.0 11.0 73.0 42 41.0 17.0 91.0 35.0 81 74.0 24.0 95.0 95.0 89 57.0 35.0 66.0 56.0 54 76.0 55.0 72.0 63.0 David
3 58 8.0 62.0 63.0 8.0 15 93.0 97.0 38.0 5.0 13 96.0 42.0 51.0 48.0 23 88.0 20.0 91.0 39.0 9 67.0 45.0 58.0 92.0 Bill
4 2 3.0 80.0 28.0 38.0 100 68.0 83.0 26.0 45.0 79 57.0 40.0 76.0 83.0 12 98.0 76.0 63.0 53.0 60 88.0 70.0 13.0 50.0 Luke
Related
This is my Jupyter code for a code written for VerticaPy - https://www.vertica.com/python/documentation_last/vdataframe/statistics.php
I have created a vDataFrame for a table and want to find avg of a column
import vertica_python
conn_info = {'host': '127.0.0.1',
'port': 5433,
'user': 'dbadmin',
'password': '',
'database': 'kaggle_titanic'}
vdf_table = vDataFrame("titanic_train_flex_view")
mean = vdf_table["age"].mean
mean
I expect to see one value of mean but I see the table itself. Why do I not see just one value of mean?
<bound method vColumn.avg of age
1 22.0
2 38.0
3 26.0
4 35.0
5 35.0
6 None
7 54.0
8 2.0
9 27.0
10 14.0
11 4.0
12 58.0
13 20.0
14 39.0
15 14.0
16 55.0
17 2.0
18 None
19 31.0
20 None
21 35.0
22 34.0
23 15.0
24 28.0
25 8.0
26 38.0
27 None
28 19.0
29 None
30 None
31 40.0
32 None
33 None
34 66.0
35 28.0
36 42.0
37 None
38 21.0
39 18.0
40 14.0
41 40.0
42 27.0
43 None
44 3.0
45 19.0
46 None
47 None
48 None
49 None
50 18.0
51 7.0
52 21.0
53 49.0
54 29.0
55 65.0
56 None
57 21.0
58 28.5
59 5.0
60 11.0
61 22.0
62 38.0
63 45.0
64 4.0
65 None
66 None
67 29.0
68 19.0
69 17.0
70 26.0
71 32.0
72 16.0
73 21.0
74 26.0
75 32.0
76 25.0
77 None
78 None
79 0.83
80 30.0
81 22.0
82 29.0
83 None
84 28.0
85 17.0
86 33.0
87 16.0
88 None
89 23.0
90 24.0
91 29.0
92 20.0
93 46.0
94 26.0
95 59.0
96 None
97 71.0
98 23.0
99 34.0
100 34.0
... ...
Rows: 1-100 of 891 | Column: age | Type: Numeric(6,3)>
DataFrame.mean is a function, unlike properties like DataFrame.shape. You need to call functions using parentheses, e.g. df.mean()
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I'm trying to find how to get sum of all columns or for specific column
I found it and here is that part of code
data.loc['total'] = data.select_dtypes(np.number).sum()
this works correctly but I get warning
C:\Users\AAR\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
table
on attached image you can see what I want to get
If you just want to do a sum, you should just do this:
Dataframe:
0 1 2 3 4 5 6 7 8 9
0 12 32 45 67 89 54 23.0 56.0 78.0 98.0
1 34 76 34 89 34 3 NaN NaN NaN NaN
2 76 34 54 12 43 78 56.0 NaN NaN NaN
3 76 56 45 23 43 45 67.0 76.0 67.0 8.0
4 87 9 9 0 89 90 6.0 89.0 NaN NaN
5 23 90 90 32 23 34 56.0 9.0 56.0 87.0
6 23 56 34 3 5 8 7.0 6.0 98.0 NaN
7 32 23 34 6 65 78 67.0 87.0 89.0 87.0
8 12 23 34 32 43 67 45.0 NaN NaN NaN
9 343 76 56 7 8 9 4.0 5.0 8.0 68.0
df = df.append(df.sum(numeric_only=True), ignore_index=True)
Output:
0 1 2 3 4 5 6 7 8 9
0 12.0 32.0 45.0 67.0 89.0 54.0 23.0 56.0 78.0 98.0
1 34.0 76.0 34.0 89.0 34.0 3.0 NaN NaN NaN NaN
2 76.0 34.0 54.0 12.0 43.0 78.0 56.0 NaN NaN NaN
3 76.0 56.0 45.0 23.0 43.0 45.0 67.0 76.0 67.0 8.0
4 87.0 9.0 9.0 0.0 89.0 90.0 6.0 89.0 NaN NaN
5 23.0 90.0 90.0 32.0 23.0 34.0 56.0 9.0 56.0 87.0
6 23.0 56.0 34.0 3.0 5.0 8.0 7.0 6.0 98.0 NaN
7 32.0 23.0 34.0 6.0 65.0 78.0 67.0 87.0 89.0 87.0
8 12.0 23.0 34.0 32.0 43.0 67.0 45.0 NaN NaN NaN
9 343.0 76.0 56.0 7.0 8.0 9.0 4.0 5.0 8.0 68.0
10 718.0 475.0 435.0 271.0 442.0 466.0 331.0 328.0 396.0 348.0
I have a dataframe like this:
original = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=["P1_day", "P1_week", "P1_month"])
print(original)
P1_day P1_week P1_month
0 50 17 55
1 45 3 10
2 93 79 84
3 99 38 33
4 44 35 35
5 25 43 87
6 38 88 56
7 20 66 6
8 4 23 6
9 39 75 3
I need to generate new dataframe starting from 3rd row of original dataframe and add new 9 columns based on rolling window defined as 3 previous rows with corresponding prefixes: [_0,_1, _2]. So, It's rows with index [0,1,2] from original dataframe .
For example, the next 3 columns will be from the original.iloc[0],
and after the next 3 columns will be from the original.iloc[1],
and the last 3 columns will be from the original.iloc[2]
I tried to solve it by the next code:
subset_shifted = original[["P1_day", "P1_week", "P1_month"]].shift(3)
subset_shifted.columns = ["P1_day_0", "P1_week_0", "P1_month_0"]
original_ = pd.concat([original, subset_shifted], axis = 1)
print(original_)
In result, I Have 3 additional columns with value from the previous 0 row:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 50 17 55 NaN NaN NaN
1 45 3 10 NaN NaN NaN
2 93 79 84 NaN NaN NaN
3 99 38 33 50.0 17.0 55.0
4 44 35 35 45.0 3.0 10.0
5 25 43 87 93.0 79.0 84.0
6 38 88 56 99.0 38.0 33.0
7 20 66 6 44.0 35.0 35.0
8 4 23 6 25.0 43.0 87.0
9 39 75 3 38.0 88.0 56.0
In the next iteration I did shift(2) with the same approach and received the columns from the original.iloc[1].
On the last iteration I did shift(1) and got expected result in view of:
result = original_.iloc[3:]
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 99 38 33 50.0 17.0 55.0 45.0 3.0 10.0 93.0 79.0 84.0
4 44 35 35 45.0 3.0 10.0 93.0 79.0 84.0 99.0 38.0 33.0
5 25 43 87 93.0 79.0 84.0 99.0 38.0 33.0 44.0 35.0 35.0
6 38 88 56 99.0 38.0 33.0 44.0 35.0 35.0 25.0 43.0 87.0
7 20 66 6 44.0 35.0 35.0 25.0 43.0 87.0 38.0 88.0 56.0
8 4 23 6 25.0 43.0 87.0 38.0 88.0 56.0 20.0 66.0 6.0
9 39 75 3 38.0 88.0 56.0 20.0 66.0 6.0 4.0 23.0 6.0
Question:
Is there any way to solve this task with better approach as I described? Thanks.
Unless you want all these extra DataFrames, you could just add the new columns to your orignal df directly:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_0", "P1_week_0", "P1_month_0"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(3)
print(original)
output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 2 35 26 NaN NaN NaN
1 99 4 96 NaN NaN NaN
2 4 67 6 NaN NaN NaN
3 76 33 31 2.0 35.0 26.0
4 84 60 98 99.0 4.0 96.0
5 57 1 58 4.0 67.0 6.0
6 35 70 96 76.0 33.0 31.0
7 81 32 39 84.0 60.0 98.0
8 25 4 38 57.0 1.0 58.0
9 83 4 60 35.0 70.0 96.0
python tutor link to example
Edit: OP asked the follow up question:
yes, for the first row it makes sense. But, my task is to add first 3 rows with index 0-1-2 as new 9 columns for the respected rows started from 3rd index. In your output row with index 1st is not added to the 3rd row as 3 columns. In my code that's why I used shift(2) and shift(1) iteratively.
Here is how this could be done iteratively:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
for shift, n in ((3,0),(2,1),(1,2)):
original[
[f"P1_day_{n}", f"P1_week_{n}", f"P1_month_{n}"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(shift)
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 \
3 58 43 74 26.0 56.0 82.0 56.0
4 44 27 40 56.0 87.0 38.0 31.0
5 2 90 4 31.0 32.0 87.0 58.0
6 90 70 6 58.0 43.0 74.0 44.0
7 1 31 57 44.0 27.0 40.0 2.0
8 96 22 69 2.0 90.0 4.0 90.0
9 13 98 47 90.0 70.0 6.0 1.0
P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 87.0 38.0 31.0 32.0 87.0
4 32.0 87.0 58.0 43.0 74.0
5 43.0 74.0 44.0 27.0 40.0
6 27.0 40.0 2.0 90.0 4.0
7 90.0 4.0 90.0 70.0 6.0
8 70.0 6.0 1.0 31.0 57.0
9 31.0 57.0 96.0 22.0 69.0
python tutor link
Edit 2: Not to make any assumptions here, but if your end goal is to get something like the 4 period moving average from the data in all of these new columns then you might not need them at all. You can use pandas.DataFrame.rolling instead:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_4PMA", "P1_week_4PMA", "P1_month_4PMA"]
] = original[
["P1_day", "P1_week", "P1_month"]
].rolling(4).mean()
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_4PMA P1_week_4PMA P1_month_4PMA
3 1 13 48 31.25 38.00 55.00
4 10 4 40 22.00 21.00 45.75
5 7 76 0 5.50 23.75 37.00
6 5 69 9 5.75 40.50 24.25
7 63 31 82 21.25 45.00 32.75
8 26 67 22 25.25 60.75 28.25
9 89 41 40 45.75 52.00 38.25
another python tutor link
I have the following df and I want to merge the lines that have the same Ids, unless there are duplicates
Ids A B C D E F G H I J
4411 24 2 55 26 1
4411 24 2 54 26 0
4412 22 4 54 26 0
4412 18 8 54 26 0
7401 12 14 54 26 0
7401 0 25 53 26 0
7402 24 2 54 26 0
7402 25 1 54 26 0
10891 16 10 54 26 0
10891 3 23 54 26 0
10891 5 10 6 15 0
Example output
Ids A B C D E F G H I J
4411 24 2 55 26 1 24 2 54 26 0
4412 22 4 54 26 0 18 8 54 26 0
7401 12 14 54 26 0 0 25 53 26 0
7402 24 2 54 26 0 25 1 54 26 0
10891 16 10 54 26 0 3 23 54 26 0
10891 5 10 6 15 0
I tried groupby but that throws errors when you write to csv.
This solution uses Divakar's justify function. If needed, convert to numeric in advance:
df = df.apply(pd.to_numeric, errors='coerce', axis=1)
Now, call groupby + transform:
df.set_index('Ids')\
.groupby(level=0)\
.transform(
justify, invalid_val=np.nan, axis=0, side='up'
)\
.dropna(how='all')
A B C D E F G H I J
Ids
4411 24.0 2.0 55.0 26.0 1.0 24.0 2.0 54.0 26.0 0.0
4412 22.0 4.0 54.0 26.0 0.0 18.0 8.0 54.0 26.0 0.0
7401 12.0 14.0 54.0 26.0 0.0 0.0 25.0 53.0 26.0 0.0
7402 24.0 2.0 54.0 26.0 0.0 25.0 1.0 54.0 26.0 0.0
10891 16.0 10.0 54.0 26.0 0.0 3.0 23.0 54.0 26.0 0.0
10891 NaN NaN NaN NaN NaN 5.0 10.0 6.0 15.0 0.0
This should be slow , but can achieve what you need
df.replace('',np.nan).groupby('Ids').apply(lambda x: pd.DataFrame(x).apply(lambda x: sorted(x, key=pd.isnull),0)).dropna(axis=0,thresh=2).fillna('')
Out[539]:
Ids A B C D E F G H I J
0 7402 24.0 2.0 54.0 26.0 0.0 25.0 1.0 54.0 26.0 0.0
2 10891 16.0 10.0 54.0 26.0 0.0 3.0 23.0 54.0 26.0 0.0
3 10891 5.0 10.0 6.0 15.0 0.0
Assuming all the blank values are nan, another option using groupby and dropna:
df.loc[:,'A':'E'] = df.groupby('Ids').apply(lambda x: x.loc[:,'A':'E'].ffill(limit=1))
df.dropna(subset=['F','G','H','I','J'])
l m learning Pandas and l m trying to learn things in it. However ,l got stuck in adding new column as new columns have larger index number.l would like to add more than 3 columns.
here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
example=pd.read_excel("C:/Users/ömer sarı/AppData/Local/Programs/Python/Python35-32/data_analysis/pydata-book-master/example.xlsx",names=["a","b","c","d","e","f","g"])
dropNan=example.dropna()
#print(dropNan)
Fillup=example.fillna(-99)
#print(Fillup)
Countt=Fillup.get_dtype_counts()
#print(Countt)
date=pd.date_range("2014-01-01","2014-01-15",freq="D")
#print(date)
mm=date.month
yy=date.year
dd=date.day
df=pd.DataFrame(example)
print(df)
x=[i for i in yy]
print(x)
df["year"]=df[x]
here is example dataset:
a b c d e f g
0 1 1.0 2.0 5 3 11.0 57.0
1 2 4.0 6.0 10 6 22.0 59.0
2 3 9.0 12.0 15 9 33.0 61.0
3 4 16.0 20.0 20 12 44.0 63.0
4 5 25.0 NaN 25 15 NaN 65.0
5 6 NaN 42.0 30 18 66.0 NaN
6 7 49.0 56.0 35 21 77.0 69.0
7 8 64.0 72.0 40 24 88.0 71.0
8 9 81.0 NaN 45 27 99.0 73.0
9 10 NaN 110.0 50 30 NaN 75.0
10 11 121.0 NaN 55 33 121.0 77.0
11 12 144.0 156.0 60 36 132.0 NaN
12 13 169.0 182.0 65 39 143.0 81.0
13 14 196.0 NaN 70 42 154.0 83.0
14 15 225.0 240.0 75 45 165.0 85.0
here is error message:
IndexError: indices are out-of-bounds
after that ,l tried that one and l got new error:
df=pd.DataFrame(range(len(x)),index=x, columns=["a","b","c","d","e","f","g"])
pandas.core.common.PandasError: DataFrame constructor not properly called!
it is just a trial to learn and how can l add the date with split parts as new column like["date","year","month","day"....]