How to set column name from column value in Pandas Python? - python

I have a data frame that looks like the following.
0
1
2
3
4
5
0: 2
57: 9
None
436: 77
11469: 1018
203: 44
0: 0
57: 15
None
436: 47
None
203: 89
0: 45
57: 0
11469: 1116
436: 7
None
203: 0
0: 1
57: 23
None
436: 0
11469: 18
None
0: 23
57: 5
None
436: 63
None
203: 4
Here, the column values represent the distance and time, in meters and seconds (57: 9 means 57 meters and 9 seconds). I want to rename my column such that the meter value becomes column name and the seconds value remains as a column value. Moreover, the columns where values are None, they should be replaced by Zero (0).
Desired output:
0
57
11469
436
11469
203
2
9
0
77
1018
44
0
15
0
47
0
89
45
0
1116
7
0
0
1
23
0
0
18
0
23
5
0
63
0
4
I am new to python so I don't know how I can achieve that.

First split each column by : with select last splitted values and replace to missing values, for columns forward filling missing values with select last row and after split select first values:
df1 = df.apply(lambda x: x.str.split(': ').str[-1]).fillna(0)
df1.columns = df.ffill().iloc[-1].str.split(': ').str[0].tolist()
print (df1)
0 57 11469 436 11469 203
0 2 9 0 77 1018 44
1 0 15 0 47 0 89
2 45 0 1116 7 0 0
3 1 23 0 0 18 0
4 23 5 0 63 0 4

Related

Python: How to replicate rows in Dataframe with column value but changing the column value to its range

Have got a dataframe df
Store Aisle Table
11 59 2
11 61 3
Need to replicate these rows w.r.t. column 'Table' times on changing 'Table' column value as below:
Store Aisle Table
11 59 1
11 59 2
11 61 1
11 61 2
11 61 3
Tried below code, but this doesn't change the value instead replicates the same row n times.
df.loc[df.index.repeat(df['Table'])]
Thanks!
You can do a groupby().cumcount() after that:
out = df.loc[df.index.repeat(df['Table'])]
out['Table'] = out.groupby(level=0).cumcount() + 1
Output:
Store Aisle Table
0 11 59 1
0 11 59 2
1 11 61 1
1 11 61 2
1 11 61 3
We can try explode
out = df.assign(Table=df['Table'].map(range)).explode('Table')
Out[160]:
Store Aisle Table
0 11 59 0
0 11 59 1
1 11 61 0
1 11 61 1
1 11 61 2

How to read .data format data with Python?

I have uploaded data from https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/ . As you see it has .data format. How to read it as pandas datframe in Python?
I try this. but it dens work:
with open("arrhythmia.data", "r") as f:
arryth_df = pd.DataFrame(f.read())
It says ValueError: DataFrame constructor not properly called!
You can pass url of file to read_csv because here .data is csv format, but no header, so added header=None:
#if want see all data
pd.options.display.max_columns = None
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None)
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 75 0 190 80 91 193 371 174 121 -16 13 64 -2 ? 63 0
1 56 1 165 64 81 174 401 149 39 25 37 -17 31 ? 53 0
2 54 0 172 95 138 163 386 185 102 96 34 70 66 23 75 0
3 55 0 175 94 100 202 380 179 143 28 11 -5 20 ? 71 0
4 75 0 190 80 88 181 360 177 103 -16 13 61 3 ? ? 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 \
0 52 44 0 0 32 0 0 0 0 0 0 0 44 20 36
1 48 0 0 0 24 0 0 0 0 0 0 0 64 0 0
2 40 80 0 0 24 0 0 0 0 0 0 20 56 52 0
3 72 20 0 0 48 0 0 0 0 0 0 0 64 36 0
4 48 40 0 0 28 0 0 0 0 0 0 0 40 24 0
...
...
...
If want also convert ? to missing values NaNs add na_values='?' parameter:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None, na_values='?')
print (df.head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 75 0 190 80 91 193 371 174 121 -16 13.0 64.0 -2.0 NaN
1 56 1 165 64 81 174 401 149 39 25 37.0 -17.0 31.0 NaN
2 54 0 172 95 138 163 386 185 102 96 34.0 70.0 66.0 23.0
3 55 0 175 94 100 202 380 179 143 28 11.0 -5.0 20.0 NaN
4 75 0 190 80 88 181 360 177 103 -16 13.0 61.0 3.0 NaN
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 \
0 63.0 0 52 44 0 0 32 0 0 0 0 0 0 0 44
1 53.0 0 48 0 0 0 24 0 0 0 0 0 0 0 64
2 75.0 0 40 80 0 0 24 0 0 0 0 0 0 20 56
3 71.0 0 72 20 0 0 48 0 0 0 0 0 0 0 64
4 NaN 0 48 40 0 0 28 0 0 0 0 0 0 0 40
...
...
Do it this way with StringIO:
from io import StringIO
import pandas as pd
with open("arrhythmia.data", "r") as f:
data = StringIO(f.read())
arryth_df = pd.read_csv(data)

How to fill in values of a dataframe column if the difference between values in another column is sufficiently small?

I have a dataframe df1:
Time Delta_time
0 0 NaN
1 15 15
2 18 3
3 30 12
4 45 15
5 64 19
6 80 16
7 82 2
8 100 18
9 120 20
where Delta_time is the difference between adjacent values in the Time column. I have another dataframe df2 that has time values numbering from 0 to 120 (121 rows) and another column called 'Short_gap'.
How do I set the value of Short_gap to 1 for all Time values that lie in a Delta_time value smaller than 5? For example, the Short_gap column should have a value of 1 for Time = 15,16,17,18 since Delta_time = 3 < 5.
Edit: Currently, df2 looks like this.
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
... ... ...
118 118 0
119 119 0
120 120 0
The expected output for df2 is
Time Short_gap
0 0 0
1 1 0
2 2 0
... ... ...
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 1
19 19 0
20 20 0
... ... ...
78 78 0
79 79 0
80 80 1
81 81 1
82 82 1
83 83 0
84 84 0
... ... ...
119 119 0
120 120 0
Try:
t = df['Delta_time'].shift(-1)
df2 = ((t < 5).repeat(t.fillna(1)).astype(int).reset_index(drop=True)
.to_frame(name='Short_gap').rename_axis('Time').reset_index())
print(df2.head(20))
print('...')
print(df2.loc[78:84])
Output:
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 0
19 19 0
...
Time Short_gap
78 78 0
79 79 0
80 80 1
81 81 1
82 82 0
83 83 0
84 84 0

get only previous three values from the dataframe

I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.
Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.
You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0

Delete lines that contain decimal numbers

I am trying to delete lines that contain decimal numbers. For instance:
82.45 76.16 21.49 -2.775
5 24 13 6 9 0 3 2 4 9 7 11 54 11 1 1 18 5 0 0
1 1 0 2 2 0 0 0 0 0 0 0 14 90 21 5 24 26 73 13
20 33 23 59 158 85 17 6 158 66 15 13 13 10 2 37 81 0 0 0
1 3 0 19 8 158 75 7 10 8 5 1 23 58 148 77 120 78 6 7
158 80 15 10 16 21 6 37 100 25 0 0 0 0 0 3 1 10 9 1
0 0 0 0 11 16 57 15 0 0 0 0 158 76 9 1 0 0 0 0
22 17 0 0 0 0 0 0
50.04 143.84 18.52 -1.792
3 0 0 0 0 0 0 0 36 0 0 0 2 4 0 1 23 2 0 0
8 24 4 12 21 9 5 2 0 0 0 4 40 0 0 0 0 0 0 12
150 11 2 7 12 16 4 59 72 8 30 88 68 83 15 27 21 11 49 94
6 1 1 8 17 8 0 0 0 0 0 5 150 150 33 46 9 0 0 20
28 49 81 150 76 5 8 17 36 23 41 48 7 1 16 88 0 3 0 0
0 0 0 0 36 108 13 9 2 0 3 61 19 26 14 34 27 8 98 150
14 2 0 1 1 0 115 150
114.27 171.37 10.74 -2.245
.................. and this pattern continues for thousands of lines and likewise I have about 3000 files with similar pattern of data.
So, I want to delete lines that have these decimal numbers. In most cases, every 8th line has decimal numbers and hence I tried using awk 'NR % 8! == 0' < file_name. But the problem is, not all files in the database have their every 8th line as decimal numbers. So, is there a way in which I can delete the lines that have decimal numbers? I am coding in python 2.7 in ubuntu.
You can just look for lines containing decimal limiters:
with open('filename_without_decimals.txt','wb') as of:
with open('filename.txt') as fp:
for line in fp:
if line.index(".") == -1: of.write(line)
If you prefer to use sed, would be cleaner:
sed -i '/\./d' file.txt
The solution would be something like
file = open('textfile.txt')
text = ""
for line in file.readLines():
if '.' not in line:
text += line
print text
have you tried this:
using awk:
awk '!/\./{print}' your_file
deci = open('with.txt')
no_deci = open('without.txt', 'w')
for line in with_deci.readlines():
if '.' not in line:
no_deci.write(line)
deci.close()
no_deci.close()
readlines returns a list of all the lines in the file.

Categories