How to read txt in certain condition - python

I have text file like below.
A1 1234 56
B2 1234 56
C3 2345167
I have the startposition and length table.
which represents each where each elements start in previous df,and length for each rows.
start length
1 1
2 1
3 1
4 2
6 2
8 2
10 1
I would like to read like below according to startposition and length.
A 1 nan 12 34 5 6
B 2 nan 12 34 5 6
C 3 nan 23 45 16 7
first,I tried
pd.read_csv(file.txt,sep=" ")
But I couldnt figure out how to split.
How can I read and split dataframe?

As mentioned in the comments, this isn't a CSV format, and so I had to produce a work-around.
def get_row_format(length_file):
with open(length_file, 'r') as fd_len:
#Read in the file, not a CSV!
#this double list-comprehension produces a list of lists
rows = [[x.strip() for x in y.split()] for y in fd_len.readlines()]
#determine the row-format from the rows lists
row_form = {int(x[0]): int(x[1]) for x in rows[1:]} #idx 1: to skip header
return row_form
def read_with_row_format(data_file, rform):
with open(data_file, 'r') as fd_data:
for row in fd_data.readlines():
#Get the formatted output
#use .items() for Python 3.x
formatted_output = [row[k-1:k+v-1] for k, v in rform.iteritems()]
print formatted_output
The first function gets the 'row-format' and the second function applies that row format to each line in the file
Usage:
rform = get_row_format('lengths.csv')
read_with_row_format('data.csv', rform)
Output:
['A', '1', '12', '34', '5', '6']
['B', '2', '12', '34', '5', '6']
['C', '3', '23', '45', '6', '7']

This is a fixed width file, you can use pandas.read_fwf:
import pandas as pd
from io import StringIO
s = StringIO("""A1 1234 56
B2 1234 56
C3 2345167""")
pd.read_fwf(s, widths = widths.length, header=None)
# 0 1 2 3 4 5 6
#0 A 1 NaN 12 34 5 6
#1 B 2 NaN 12 34 5 6
#2 C 3 NaN 23 45 16 7
The widths data frame:
widths = pd.read_csv(StringIO("""start length
1 1
2 1
3 1
4 2
6 2
8 2
10 1"""), sep = "\s+")

Since you have the starting position and length of each field, use them.
Here is code to carry that out. Each line is taken in turn. Each field is the slice from the start column to the same position plus the length of the field.
I leave the conversions to you.
data = [
"A1 1234 56",
"B2 1234 56",
"C3 2345167"
]
table = [
[1, 1],
[2, 1],
[3, 1],
[4, 2],
[6, 2],
[8, 2],
[10, 1]
]
for line in data:
fields = [line[(table[col][0]-1) : (table[col][0]+table[col][1]-1)] for col in range(len(table))]
print fields

Related

Adding 2 rows with 0s at the start and end of pandas dataframe

I have a pandas Dataframe named dataframe.
I want to add two rows at the start and end of the data frame with 0s.
#create DataFrame
df_x = pd.DataFrame({'logvalue': ['20', '20.5', '18.5', '2', '10'],
'ID': ['1', '2', '3', '4', '5']})
Output should look like below.
logvalue
ID
violatedInstances
0
0
0
20
1
0
20.5
2
1
18.5
3
0
2
4
1
10
5
1
0
0
0
The output should rearrange the indexes of the dataframe as well.
How can I do this in pandas?
You can use concat:
First create a new dataframe (df_y) that contains the zero'd row
Use the concat function to join this dataframe with the original
Use the reset_index(drop=True) function to reset the index.
Code:
df_x = pd.DataFrame({ 'logvalue': [20.0, 20.5, 18.5, 2.0, 10.0, 0.0],
'ID': [1, 2, 3, 4, 5, 0],
'violatedInstances': [0, 1, 0, 1, 1, 0]})
# Extract the column names from the original dataframe
column_names = df_x.columns
number_of_columns = len(column_names)
row_of_zeros = [0]*number_of_columns
# Create a new dataframe that has a row of zeros
df_y = pd.DataFrame([row_of_zeros], columns=column_names)
# Join the dataframes together
output = pd.concat([df_y, df_x, df_y]).reset_index(drop=True)
print(output)
Output:
logvalue ID violatedInstances
0 0.0 0 0
1 20.0 1 0
2 20.5 2 1
3 18.5 3 0
4 2.0 4 1
5 10.0 5 1
6 0.0 0 0
7 0.0 0 0
Example
df_x = pd.DataFrame({'logvalue': ['20', '20.5', '18.5', '2', '10'],
'ID': ['1', '2', '3', '4', '5']})
df_x
logvalue ID
0 20 1
1 20.5 2
2 18.5 3
3 2 4
4 10 5
Code
use reindex with fill_value
idx = ['start'] + df_x.index.tolist() + ['end']
df_x.reindex(idx, fill_value=0).reset_index(drop=True)
result:
logvalue ID
0 0 0
1 20 1
2 20.5 2
3 18.5 3
4 2 4
5 10 5
6 0 0
['start'] and ['end'] of idx variable : any label that is not in index of df_x.

how to append a list of of lists with different lengths

I am trying to use a list of lists to add rows to a dataframe.
The error is as follows:
IndexError: invalid index to scalar variable.
The code is below:
Total_List = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
Some_List = ['0', '1', '2', '3', '4']
first_row = {'A': [0], 'B': [0], 'C': [0]}
All_Rows = pd.DataFrame(first_row)
#Optimized_Trades
for i in range(len(Some_List)):
for j in range(len(Some_List[i])):
df_temp = { 'A': Total_List[i][j], 'B': Total_List[i][j], 'C': Total_List[i][j]}
All_Rows = All_Rows.append(df_temp, ignore_index = True)
All_Trades = All_Trades[1:]
display(All_Trades)
Ideally, the final output would be:
1,4,7,10,13
2,5,8,11,14
3,6,9,12,15
IIUC, you want to add each of the first, second ... nth elements of each sublist as rows of the data frame, which is equivalent to the dataframe of the transpose of the list of lists.
You don't need a for loop to do this in python.
Using zip with unpacking operator
You can do it in a few ways but the one way would be zip with unpacking operator * using list(zip(*l))
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
lt = pd.DataFrame(zip(*l)) #<---
print(lt)
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15
Using pandas transpose
A simpler way would be to use pandas to do this where you can simply use transpose -
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
lt = pd.DataFrame(l).T #<---
print(lt)
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15

Python - How to sort except one index [duplicate]

This question already has an answer here:
Pandas Python: sort dataframe but don't include given row
(1 answer)
Closed 4 years ago.
columns=['NAME', 'AB', 'H']
import pandas as pd
df = pd.DataFrame([['Harper', '10', '5'], ['Trout', '10', '5'], ['Ohtani', '10', '5'], ['TOTAL', '30', '15']], columns=columns)
df1 = df.sort_values(by='NAME')
print(df1)
the result is
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
3 TOTAL 30 15
1 Trout 10 5
I want to sort the dataframe except index of 'TOTAL'.
Try following code to sort the df by 'NAME' by excluding 'Total':
df1 = df[df.NAME!='TOTAL'].sort_values(by='NAME')
Output:
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
1 Trout 10 5
You can append back the 'Total' after sorting by:
df1 = df1.append(df[df.NAME=='TOTAL'])
Output:
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
1 Trout 10 5
3 TOTAL 30 15

Create python dataframe based on nested loop

I am a new in Python pandas, so sorry if this question is very easy.
I have three lists:
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
I want to make those lists combined and show it in data frame.
I have tried to use list.append
new_list = []
for i in A:
new_list.append(i)
for j in (M):
new_list.append(j)
print(new_list)
['A', '1', '2', '3', 'B', '1', '2', '3', 'C', '1', '2', '3']
I confused, how to get the output like this (in dataframe):
It seems as if you want to create all list of all possible permutations. You can do this with itertools and pandas. Itertools is a native library to python:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
Output:
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
6 A 3 plus
7 A 3 minus
8 A 3 square
9 B 1 plus
10 B 1 minus
11 B 1 square
12 B 2 plus
13 B 2 minus
14 B 2 square
15 B 3 plus
16 B 3 minus
17 B 3 square
18 C 1 plus
19 C 1 minus
20 C 1 square
21 C 2 plus
22 C 2 minus
23 C 2 square
24 C 3 plus
25 C 3 minus
26 C 3 square
What you need is a Cartesian product of the three sets:
import pandas as pd
from itertools import product
pd.DataFrame(list(product(A,M,F)), columns=['A', 'M', 'F'])

Count Total number of sequences that meet condition, without for-loop

I have the following Dataframe as input:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
df = pd.DataFrame(l)
print(df)
0
0 2
1 2
2 2
3 5
4 5
5 5
6 3
7 3
8 2
9 2
10 4
11 4
12 6
13 5
14 5
15 3
16 5
As an output I would like to have a final count of the total sequences that meet a certain condition. For example, in this case, I want the number of sequences that the values are greater than 3.
So, the output is 3.
1st Sequence = [555]
2nd Sequence = [44655]
3rd Sequence = [5]
Is there a way to calculate this without a for-loop in pandas ?
I have already implemented a solution using for-loop, and I wonder if there is better approach using pandas in O(N) time.
Thanks very much!
Related to this question: How to count the number of time intervals that meet a boolean condition within a pandas dataframe?
You can use:
m = df[0] > 3
df[1] = (~m).cumsum()
df = df[m]
print (df)
0 1
3 5 3
4 5 3
5 5 3
10 4 7
11 4 7
12 6 7
13 5 7
14 5 7
16 5 8
#create tuples
df = df.groupby(1)[0].apply(tuple).value_counts()
print (df)
(5, 5, 5) 1
(4, 4, 6, 5, 5) 1
(5,) 1
Name: 0, dtype: int64
#alternativly create strings
df = df.astype(str).groupby(1)[0].apply(''.join).value_counts()
print (df)
5 1
44655 1
555 1
Name: 0, dtype: int64
If need output as list:
print (df.astype(str).groupby(1)[0].apply(''.join).tolist())
['555', '44655', '5']
Detail:
print (df.astype(str).groupby(1)[0].apply(''.join))
3 555
7 44655
8 5
Name: 0, dtype: object
If you don't need pandas this will suit your needs:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
def consecutive(array, value):
result = []
sub = []
for item in array:
if item > value:
sub.append(item)
else:
if sub:
result.append(sub)
sub = []
if sub:
result.append(sub)
return result
print(consecutive(l,3))
#[[5, 5, 5], [4, 4, 6, 5, 5], [5]]

Categories