How to parse tables from .txt files using Pandas

How to parse tables from .txt files using Pandas - python

I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?

With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None

Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)

Related

How to combine some rows into a single row

Sorry, I should delete the old question, and create the new one.
I have a dataframe with two columns. The df looks as follows:
Word Tag
0 Asam O
1 instruksi O
2 - O
3 instruksi X
4 bahasa Y
5 Instruksi P
6 - O
7 instruksi O
8 sebuah Q
9 satuan K
10 - L
11 satuan O
12 meja W
13 Tiap Q
14 - O
15 tiap O
16 karakter P
17 - O
18 ke O
19 - O
20 karakter O
and I'd like to merge some rows which contain dash - to one row. so the output should be the following:
Word Tag
0 Asam O
1 instruksi-instruksi O
2 bahasa Y
3 Instruksi-instruksi P
4 sebuah Q
5 satuan-satuan K
6 meja W
7 Tiap-tiap Q
8 karakter-ke-karakter P
Any ideas? Thanks in advance. I have tried the answer from Jacob K, it works, then I found in my dataset, there are more than one - row in between. I have put the expected output, like index number 8
Solution from Jacob K:
# Import packages
import pandas as pd
import numpy as np
# Get 'Word' and 'Tag' columns as numpy arrays (for easy indexing)
words = df.Word.to_numpy()
tags = df.Tag.to_numpy()
# Create empty lists for new colums in output dataframe
newWords = []
newTags = []
# Use while (rather than for loop) since index i can change dynamically
i = 0 # To not cause any issues with i-1 index
while (i < words.shape[0] - 1):
if (words[i] == "-"):
# Concatenate the strings above and below the "-"
newWords.append(words[i-1] + "-" + words[i+1])
newTags.append(tags[i-1])
i += 2 # Don't repeat any concatenated values
else:
if (words[i+1] != "-"):
# If there is no "-" next, append the regular word and tag values
newWords.append(words[i])
newTags.append(tags[i])
i += 1 # Increment normally
# Create output dataframe output_df
d2 = {'Word': newWords, 'Tag': newTags}
output_df = pd.DataFrame(data=d2)

My approach with GroupBy.agg:
#df['Word'] = df['Word'].str.replace(' ', '') #if necessary
blocks = df['Word'].shift().ne('-').mul(df['Word'].ne('-')).cumsum()
new_df = df.groupby(blocks, as_index=False).agg({'Word' : ''.join, 'Tag' : 'first'})
print(new_df)
Output
Word Tag
0 Asam O
1 instruksi-instruksi O
2 bahasa Y
3 Instruksi-instruksi P
4 sebuah Q
5 satuan-satuan K
6 meja W
7 Tiap-tiap Q
8 karakter-ke-karakter P
Blocks (Detail)
print(blocks)
0 1
1 2
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 6
10 6
11 6
12 7
13 8
14 8
15 8
16 9
17 9
18 9
19 9
20 9
Name: Word, dtype: int64

This is a loop version:
import pandas as pd
# import data
DF = pd.read_csv("table.csv")
# creates a new DF
newDF = pd.DataFrame()
# iterate through rows
for i in range(len(DF)-1):
# prepare prev row index (?dealing with private instance of first row)
prev = i-1
if (prev < 0):
prev = 0
# copy column if the row is not '-' and the next row is not '-'
if (DF.loc[i+1, 'Word'] != '-'):
if (DF.loc[i, 'Word'] != '-' and DF.loc[prev, 'Word'] != '-'):
newDF = newDF.append(DF.loc[i, :])
# units the three rows if the middle one is '-'
else:
row = {'Tag': [DF.loc[i, 'Tag']], 'Word': [DF.loc[i, 'Word']+DF.loc[i+1, 'Word']+DF.loc[i+2, 'Word']]}
newDF = newDF.append(pd.DataFrame(row))

How to find the same characters in a string and display them as a quantity

I am trying to display each of the characters with their quantity
Input Specification
The first line of input contains the number N, which is the number of lines that follow. The next
N lines will contain at least one and at most 80 characters, none of which are spaces.
Output Specification
Output will be N lines. Line i of the output will be the encoding of the line i + 1 of the input.
The encoding of a line will be a sequence of pairs, separated by a space, where each pair is an
integer (representing the number of times the character appears consecutively) followed by a space,
followed by the character.
Sample Input
4
+++===!!!!
777777......TTTTTTTTTTTT
(AABBC)
3.1415555
Output for Sample Input
3 + 3 = 4 !
6 7 6 . 12 T
1 ( 2 A 2 B 1 C 1 )
1 3 1 . 1 1 1 4 1 1 4 5

just use itertools.groupby and format the result: value and length of the group. Join the elements:
import itertools
s = "+++===!!!! 777777......TTTTTTTTTTTT (AABBC) 3.1415555"
result = "".join(["{} {}".format(sum(1 for _ in group),value) for value,group in itertools.groupby(s)])
result:
3 + 3 = 4 ! 1 6 7 6 . 12 T 1 1 ( 2 A 2 B 1 C 1 ) 1 1 3 1 . 1 1 1 4 1 1 4 5
without a key parameter, itertools.groupby just groups identical items into groups, preserving the order. Just count them. Here I chose to not create a list to consume the group (len(list(group))) but just do sum(1 for _ in group)

I'd do something like this:
s = "+++===!!!! 777777......TTTTTTTTTTTT (AABBC) 3.1415555"
d = {char: 0 for char in s}
for char in s:
d[char] += 1
output = "".join([" {} {}".format(value, key) for key, value in d.items()])
# outputs: '3 + 3 = 4 ! 3 6 7 7 . 1 2 T 1 ( 2 A 2 B 1 C 1 ) 1 3 2 1 1 4 4 5'

Since it looks like you aren't looking for total repeating characters, I would suggest reading the string backward and for each character, you want to count how many times it appears as you're iterating through, and once you hit a different character you use the current count for the output. In fact, you could generate your output as you iterate through the string backward.
It might look something like this:
reverse = input[-1:0]
output = ''
count = 0
letter = reverse[0]
for k in range(0, len(reverse)):
if reverse[k] == letter and k != len(reverse) - 1:
count += 1
else:
output = str(count) + ' ' + reverse[k] + ' ' + output
letter = reverse[k]
count = 0

Group by a range of numbers Python

I have a list of numbers in a python data frame and want to group these numbers by a specific range and count. The numbers range from 0 to 20 but lets say there might not be any number 6 in that case I want it to show 0.
dataframe column looks like
|points|
5
1
7
3
2
2
1
18
15
4
5
I want it to look like the below
range | count
1 2
2 2
3 1
4 1
5 2
6 0
7 ...
8
9...

I would iterate through the input lines and fill up a dict with the values.
All you have to do then is count...
import collections
#read your input and store the numbers in a list
lines = []
with open('input.txt') as f:
lines = [int(line.rstrip()) for line in f]
#pre fill the dictionary with 0s from 0 to the highest occurring number in your input.
values = {}
for i in range(max(lines)+1):
values[i] = 0
# increment the occurrence by 1 for any found value
for val in lines:
values[val] += 1
# Order the dict:
values = collections.OrderedDict(sorted(values.items()))
print("range\t|\tcount")
for k in values:
print(str(k) + "\t\t\t" + str(values[k]))
repl: https://repl.it/repls/DesertedDeafeningCgibin
Edit:
a slightly more elegant version using dict comprehension:
# read input as in the first example
values = {i : 0 for i in range(max(lines)+1)}
for val in lines:
values[val] += 1
# order and print as in the first example

How do I determine how many characters in common two pandas columns have?

I have a dataframe with two columns. I want to know how many characters they have in common. The number of common elements should be a new column. Here's a minimally reproducible example.
What I have:
import pandas as pd
from string import ascii_lowercase
import numpy as np
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
Out[17]:
col_1 col_2
0 ollcgfmy daeubsrx
1 jtvtqoux xbgtrzno
2 irwmoqqa mdblczfa
3 jyebzpyd xwlynkhw
4 ifuqojvs lxotbsju
5 fybsqbku xwbluaek
6 oylztnpf gelonsay
7 zdkibutk ujlcwhfu
8 uhrcjbsk nhxhpoii
9 eocxreqz muvfwusi
What I need (the numbers are random):
Out[19]:
col_1 col_2 common_letters
0 ollcgfmy daeubsrx 1
1 jtvtqoux xbgtrzno 1
2 irwmoqqa mdblczfa 0
3 jyebzpyd xwlynkhw 3
4 ifuqojvs lxotbsju 3
5 fybsqbku xwbluaek 3
6 oylztnpf gelonsay 3
7 zdkibutk ujlcwhfu 3
8 uhrcjbsk nhxhpoii 1
9 eocxreqz muvfwusi 3
EDIT: to anyone reading this trying to get similarity between two strings, don't use this approach. other similarity measures exist, such as levenshtein or jaccard.

Using df.apply and set operations can be one way to solve the problem:
df["common_letters"] = df.apply(
lambda x: len(set(x["col_1"]).intersection(set(x["col_2"]))),
axis=1)
output:
col_1 col_2 common_letters
0 cgeabfem amnwfsde 4
1 vozgpmgs slfwvjnv 2
2 xyvktrfr jtzijmud 1
3 piexmmgh ydaxbmyo 2
4 iydpnwcu hhdxyptd 3

If you like sets you can go for:
df['common_letters'] = (df.col_1.apply(set).apply(len)
+ df.col_2.apply(set).apply(len)
- (df.col_1+df.col_2).apply(set).apply(len))

You can use numpy:
df["noCommonChars"]=np.bitwise_and(df["col_1"].map(set), df["col_2"].map(set)).str.len()
Output:
col_1 col_2 noCommonChars
0 smuxucyw hywtedvz 2
1 bniuqhkh axcuukjg 2
2 ttzehrtl nbmsmwsc 0
3 ndwyjusu dssmdnvb 3
4 zqvsvych wguthcwu 2
5 jlnpjqgn xgedmodm 1
6 ocjbtnpy lywjqkjf 2
7 tolrpshi hslxxmgo 4
8 ehatmryw fhpvluvq 1
9 icciebte joyiwooi 1
Edit
In order to include repeating characters - you can do:
from collections import Counter
df["common_letters_full"]=np.bitwise_and(df["col_1"].map(Counter), df["col_2"].map(Counter))
df["common_letters"]=df["common_letters_full"].map(dict.values).apply(sum)
#alternatively:
df["common_letters"]=df["common_letters_full"].apply(pd.Series).sum(axis=1)

I already see better answers :D , but here goes a kind of good one. I might have been able to use more from pandas:
I took some code from here
import pandas as pd
from string import ascii_lowercase
import numpy as np
def countPairs(s1, s2) :
n1 = len(s1) ;
n2 = len(s2);
# To store the frequencies of characters
# of string s1 and s2
freq1 = [0] * 26;
freq2 = [0] * 26;
# To store the count of valid pairs
count = 0;
# Update the frequencies of
# the characters of string s1
for i in range(n1) :
freq1[ord(s1[i]) - ord('a')] += 1;
# Update the frequencies of
# the characters of string s2
for i in range(n2) :
freq2[ord(s2[i]) - ord('a')] += 1;
# Find the count of valid pairs
for i in range(26) :
count += min(freq1[i], freq2[i]);
return count;
# This code is contributed by Ryuga
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
counts = []
for i in range(0,df.shape[0]):
counts.append(countPairs(df.iloc[i].col_1,df.iloc[i].col_2))
df["counts"] = counts
col_1 col_2 counts
0 ploatffk dwenjpmc 1
1 gjjupyqg smqtlmzc 1
2 cgtxexho hvwhpyfh 1
3 mifsbfhc ufalhlbi 4
4 qnjesfdn lyhrrnkf 2
5 omnumzmf dagttzqo 2
6 gsygkrrb aocfoqxk 1
7 wrgvruuw ydnlzvyf 1
8 ivkdxoft zmgcnrjr 0
9 vvthbzjj mmirlcvx 1

pandas display: truncate column display rather than wrapping

With lengthy column names, DataFrames will display in a very messy form seemingly no matter what options are set.
Info: I'm in Jupyter QtConsole, pandas 0.20.1, with the following relevant options specified at startup:
pd.set_option('display.max_colwidth', 20)
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_rows', 25)
Question: how can I truncate the DataFrame if necessary rather than wrapping the columns to the next line, while keeping expand_frame_repr=False?
Here's an example. Again, the issue doesn't depend on the number of columns but length of the columns.
This will not cause an issue:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['col' + str(i) for i in range(1000)])
As the output is perfectly readable and looks like:
The same DataFrame with long column names causes the issue I'm talking about:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['very_long_col_name_'
+ str(i) for i in range(1000)])
Is there any way to conform the second output to be like the first that I'm missing? (Through specifying an option, not through using .iloc every time I want to view.)

Use max_columns
from string import ascii_letters
df = pd.DataFrame(np.random.randint(10, size=(5, 52)), columns=list(ascii_letters))
with pd.option_context(
'display.max_colwidth', 20,
'expand_frame_repr', False,
'display.max_rows', 25,
'display.max_columns', 5,
):
print(df.add_prefix('really_long_column_name_'))
really_long_column_name_a really_long_column_name_b ... really_long_column_name_Y really_long_column_name_Z
0 8 1 ... 1 9
1 8 5 ... 2 1
2 5 0 ... 9 9
3 6 8 ... 0 9
4 1 2 ... 7 1
[5 rows x 52 columns]
Another idea... Obviously not exactly what you want, but maybe you can twist it to your needs.
d1 = df.add_suffix('_really_long_column_name')
with pd.option_context('display.max_colwidth', 4, 'expand_frame_repr', False):
mw = pd.get_option('display.max_colwidth')
print(d1.rename(columns=lambda x: x[:mw-3] + '...' if len(x) > mw else x))
a... b... c... d... e... f... g... h... i... j... ... Q... R... S... T... U... V... W... X... Y... Z...
0 6 5 5 5 8 3 5 0 7 6 ... 9 0 6 9 6 8 4 0 6 7
1 0 5 4 7 2 5 4 3 8 7 ... 8 1 5 3 5 9 4 5 5 3
2 7 2 1 6 5 1 0 1 3 1 ... 6 7 0 9 9 5 2 8 2 2
3 1 8 7 1 4 5 5 8 8 3 ... 3 6 5 7 1 0 8 1 4 0
4 7 5 6 2 4 9 7 9 0 5 ... 6 8 1 6 3 5 4 2 3 2

Looks like it will need an enhancement. The relevant code in the repr function appears to be here:
max_rows = get_option("display.max_rows")
max_cols = get_option("display.max_columns")
show_dimensions = get_option("display.show_dimensions")
if get_option("display.expand_frame_repr"):
width, _ = console.get_console_size()
else:
width = None
self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
line_width=width, show_dimensions=show_dimensions)
So either you pass expand_frame_repr=True and it wraps on the line width, or you pass expand_frame_repr=False and it shouldn't. But it looks like there is a bug in the code (this should be pandas 0.20.3 iirc):
in pd.io.formats.format.DataFrameFormatter:
def _chk_truncate(self):
"""
Checks whether the frame should be truncated. If so, slices
the frame up.
"""
from pandas.core.reshape.concat import concat
# Column of which first element is used to determine width of a dot col
self.tr_size_col = -1
# Cut the data to the information actually printed
max_cols = self.max_cols
max_rows = self.max_rows
if max_cols == 0 or max_rows == 0: # assume we are in the terminal
# (why else = 0)
(w, h) = get_terminal_size()
self.w = w
self.h = h
if self.max_rows == 0:
dot_row = 1
prompt_row = 1
if self.show_dimensions:
show_dimension_rows = 3
n_add_rows = (self.header + dot_row + show_dimension_rows +
prompt_row)
# rows available to fill with actual data
max_rows_adj = self.h - n_add_rows
self.max_rows_adj = max_rows_adj
# Format only rows and columns that could potentially fit the
# screen
if max_cols == 0 and len(self.frame.columns) > w:
max_cols = w
if max_rows == 0 and len(self.frame) > h:
max_rows = h
Looks like it intended to do what you wanted, but was unfinished. It's checking max_cols against the number of columns, not the total width of the columns.
So you could either create a show_df function that would calculate the correct number of columns and show it in an option_context like pi2Squared's answer, or fix it here (and maybe submit a patch if you need it distributed).

As others have pointed out, Pandas itself seems to be bugged or badly designed here, so a workaround is required.
Most of the time this problem occurs with numerical columns, since numbers are relatively short. Pandas will split the column heading onto multiple lines if there are spaces in it, so you can "hack in" the correct behavior by inserting spaces into column headings for numerical columns when you display the dataframe. I have a one-liner to do this:
def colfix(df, L=5): return df.rename(columns=lambda x: ' '.join(x.replace('_', ' ')[i:i+L] for i in range(0,len(x),L)) if df[x].dtype in ['float64','int64'] else x )
do display your dataframe, simply type
colfix(your_df)
note that the renaming is not going to permanently change the dataframe, it will only add spaces to the names for the purposes of displaying it that one time.
Results (in a Jupyter Notebook):
With colfix:
Without:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse tables from .txt files using Pandas - python

Related

How to combine some rows into a single row

How to find the same characters in a string and display them as a quantity

Group by a range of numbers Python

How do I determine how many characters in common two pandas columns have?

pandas display: truncate column display rather than wrapping

Categories

Resources