How to set column name from column value in Pandas Python?

How to set column name from column value in Pandas Python? - python

I have a data frame that looks like the following.
0
1
2
3
4
5
0: 2
57: 9
None
436: 77
11469: 1018
203: 44
0: 0
57: 15
None
436: 47
None
203: 89
0: 45
57: 0
11469: 1116
436: 7
None
203: 0
0: 1
57: 23
None
436: 0
11469: 18
None
0: 23
57: 5
None
436: 63
None
203: 4
Here, the column values represent the distance and time, in meters and seconds (57: 9 means 57 meters and 9 seconds). I want to rename my column such that the meter value becomes column name and the seconds value remains as a column value. Moreover, the columns where values are None, they should be replaced by Zero (0).
Desired output:
0
57
11469
436
11469
203
2
9
0
77
1018
44
0
15
0
47
0
89
45
0
1116
7
0
0
1
23
0
0
18
0
23
5
0
63
0
4
I am new to python so I don't know how I can achieve that.

First split each column by : with select last splitted values and replace to missing values, for columns forward filling missing values with select last row and after split select first values:
df1 = df.apply(lambda x: x.str.split(': ').str[-1]).fillna(0)
df1.columns = df.ffill().iloc[-1].str.split(': ').str[0].tolist()
print (df1)
0 57 11469 436 11469 203
0 2 9 0 77 1018 44
1 0 15 0 47 0 89
2 45 0 1116 7 0 0
3 1 23 0 0 18 0
4 23 5 0 63 0 4

Related

Python: How to replicate rows in Dataframe with column value but changing the column value to its range

Have got a dataframe df
Store Aisle Table
11 59 2
11 61 3
Need to replicate these rows w.r.t. column 'Table' times on changing 'Table' column value as below:
Store Aisle Table
11 59 1
11 59 2
11 61 1
11 61 2
11 61 3
Tried below code, but this doesn't change the value instead replicates the same row n times.
df.loc[df.index.repeat(df['Table'])]
Thanks!

You can do a groupby().cumcount() after that:
out = df.loc[df.index.repeat(df['Table'])]
out['Table'] = out.groupby(level=0).cumcount() + 1
Output:
Store Aisle Table
0 11 59 1
0 11 59 2
1 11 61 1
1 11 61 2
1 11 61 3

We can try explode
out = df.assign(Table=df['Table'].map(range)).explode('Table')
Out[160]:
Store Aisle Table
0 11 59 0
0 11 59 1
1 11 61 0
1 11 61 1
1 11 61 2

How to read .data format data with Python?

I have uploaded data from https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/ . As you see it has .data format. How to read it as pandas datframe in Python?
I try this. but it dens work:
with open("arrhythmia.data", "r") as f:
arryth_df = pd.DataFrame(f.read())
It says ValueError: DataFrame constructor not properly called!

You can pass url of file to read_csv because here .data is csv format, but no header, so added header=None:
#if want see all data
pd.options.display.max_columns = None
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None)
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 75 0 190 80 91 193 371 174 121 -16 13 64 -2 ? 63 0
1 56 1 165 64 81 174 401 149 39 25 37 -17 31 ? 53 0
2 54 0 172 95 138 163 386 185 102 96 34 70 66 23 75 0
3 55 0 175 94 100 202 380 179 143 28 11 -5 20 ? 71 0
4 75 0 190 80 88 181 360 177 103 -16 13 61 3 ? ? 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 \
0 52 44 0 0 32 0 0 0 0 0 0 0 44 20 36
1 48 0 0 0 24 0 0 0 0 0 0 0 64 0 0
2 40 80 0 0 24 0 0 0 0 0 0 20 56 52 0
3 72 20 0 0 48 0 0 0 0 0 0 0 64 36 0
4 48 40 0 0 28 0 0 0 0 0 0 0 40 24 0
...
...
...
If want also convert ? to missing values NaNs add na_values='?' parameter:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None, na_values='?')
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 75 0 190 80 91 193 371 174 121 -16 13.0 64.0 -2.0 NaN
1 56 1 165 64 81 174 401 149 39 25 37.0 -17.0 31.0 NaN
2 54 0 172 95 138 163 386 185 102 96 34.0 70.0 66.0 23.0
3 55 0 175 94 100 202 380 179 143 28 11.0 -5.0 20.0 NaN
4 75 0 190 80 88 181 360 177 103 -16 13.0 61.0 3.0 NaN
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 \
0 63.0 0 52 44 0 0 32 0 0 0 0 0 0 0 44
1 53.0 0 48 0 0 0 24 0 0 0 0 0 0 0 64
2 75.0 0 40 80 0 0 24 0 0 0 0 0 0 20 56
3 71.0 0 72 20 0 0 48 0 0 0 0 0 0 0 64
4 NaN 0 48 40 0 0 28 0 0 0 0 0 0 0 40
...
...

Do it this way with StringIO:
from io import StringIO
import pandas as pd
with open("arrhythmia.data", "r") as f:
data = StringIO(f.read())
arryth_df = pd.read_csv(data)

How to fill in values of a dataframe column if the difference between values in another column is sufficiently small?

I have a dataframe df1:
Time Delta_time
0 0 NaN
1 15 15
2 18 3
3 30 12
4 45 15
5 64 19
6 80 16
7 82 2
8 100 18
9 120 20
where Delta_time is the difference between adjacent values in the Time column. I have another dataframe df2 that has time values numbering from 0 to 120 (121 rows) and another column called 'Short_gap'.
How do I set the value of Short_gap to 1 for all Time values that lie in a Delta_time value smaller than 5? For example, the Short_gap column should have a value of 1 for Time = 15,16,17,18 since Delta_time = 3 < 5.
Edit: Currently, df2 looks like this.
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
... ... ...
118 118 0
119 119 0
120 120 0
The expected output for df2 is
Time Short_gap
0 0 0
1 1 0
2 2 0
... ... ...
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 1
19 19 0
20 20 0
... ... ...
78 78 0
79 79 0
80 80 1
81 81 1
82 82 1
83 83 0
84 84 0
... ... ...
119 119 0
120 120 0

Try:
t = df['Delta_time'].shift(-1)
df2 = ((t < 5).repeat(t.fillna(1)).astype(int).reset_index(drop=True)
.to_frame(name='Short_gap').rename_axis('Time').reset_index())
print(df2.head(20))
print('...')
print(df2.loc[78:84])
Output:
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 0
19 19 0
...
Time Short_gap
78 78 0
79 79 0
80 80 1
81 81 1
82 82 0
83 83 0
84 84 0

get only previous three values from the dataframe

I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.

Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.

You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0

Delete lines that contain decimal numbers

I am trying to delete lines that contain decimal numbers. For instance:
82.45 76.16 21.49 -2.775
5 24 13 6 9 0 3 2 4 9 7 11 54 11 1 1 18 5 0 0
1 1 0 2 2 0 0 0 0 0 0 0 14 90 21 5 24 26 73 13
20 33 23 59 158 85 17 6 158 66 15 13 13 10 2 37 81 0 0 0
1 3 0 19 8 158 75 7 10 8 5 1 23 58 148 77 120 78 6 7
158 80 15 10 16 21 6 37 100 25 0 0 0 0 0 3 1 10 9 1
0 0 0 0 11 16 57 15 0 0 0 0 158 76 9 1 0 0 0 0
22 17 0 0 0 0 0 0
50.04 143.84 18.52 -1.792
3 0 0 0 0 0 0 0 36 0 0 0 2 4 0 1 23 2 0 0
8 24 4 12 21 9 5 2 0 0 0 4 40 0 0 0 0 0 0 12
150 11 2 7 12 16 4 59 72 8 30 88 68 83 15 27 21 11 49 94
6 1 1 8 17 8 0 0 0 0 0 5 150 150 33 46 9 0 0 20
28 49 81 150 76 5 8 17 36 23 41 48 7 1 16 88 0 3 0 0
0 0 0 0 36 108 13 9 2 0 3 61 19 26 14 34 27 8 98 150
14 2 0 1 1 0 115 150
114.27 171.37 10.74 -2.245
.................. and this pattern continues for thousands of lines and likewise I have about 3000 files with similar pattern of data.
So, I want to delete lines that have these decimal numbers. In most cases, every 8th line has decimal numbers and hence I tried using awk 'NR % 8! == 0' < file_name. But the problem is, not all files in the database have their every 8th line as decimal numbers. So, is there a way in which I can delete the lines that have decimal numbers? I am coding in python 2.7 in ubuntu.

You can just look for lines containing decimal limiters:
with open('filename_without_decimals.txt','wb') as of:
with open('filename.txt') as fp:
for line in fp:
if line.index(".") == -1: of.write(line)
If you prefer to use sed, would be cleaner:
sed -i '/\./d' file.txt

The solution would be something like
file = open('textfile.txt')
text = ""
for line in file.readLines():
if '.' not in line:
text += line
print text

have you tried this:
using awk:
awk '!/\./{print}' your_file

deci = open('with.txt')
no_deci = open('without.txt', 'w')
for line in with_deci.readlines():
if '.' not in line:
no_deci.write(line)
deci.close()
no_deci.close()
readlines returns a list of all the lines in the file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set column name from column value in Pandas Python? - python

Related

Python: How to replicate rows in Dataframe with column value but changing the column value to its range

How to read .data format data with Python?

How to fill in values of a dataframe column if the difference between values in another column is sufficiently small?

get only previous three values from the dataframe

Delete lines that contain decimal numbers

Categories

Resources