Python/Pandas: .CSV Data Cleanup For Migration - python

I'm a fairly new python/pandas user, and I have been tasked with the cleanup of a roughly 5,000 row csv of records, and the subsequent migration of the records into a sql database.
The contents are individual people's personal information(which prevents me from posting it for reference) and their 'seat' occupation information, but the file has been... mismanaged... over the years, and has ended up looking like this:
#Sect1 Sect2 Sect3 Seat#
L/L/L/L 320/320/319/321 D/C/D/C 1-2/1-2/1-2/1-2
V 602 - 1-6
T 101 F 1&3
R 158 - 3* 4
U 818 4 Ds9R
With that individual's personal information in four columns not shown to the left.
In reality, even just the top row from the selection above should actually be:
#Sect1 Sect2 Sect3 Seat#
L 320 D 1
L 320 D 2
L 320 C 1
L 320 C 2
L 319 D 1
L 319 D 2
L 321 C 1
L 321 C 2
with the '-'s implying that it's 'through' not 'and'. (For example; the second row in my original example would be Seat# 1 through Seat# 6, not Seat# 1 and 6.
I should also note that there's no unique ID/Index for these individuals, and it's purely based on First/Last name.
I've been attempting to break some of this up, and have limited success with
df1 = df1.drop('Sect2', axis=1).join(df1['Sect2'].str.split('/', expand=True).stack().reset_index(level=1, drop=True).rename('Sect2'))
but this eventually ends up creating erroneous records such as
#Sect1 Sect2 Sect3 Seat#
L 319 C 1
In the end, my question is; Is using a script to clean this data even possible? I'm rapidly running out of ideas, and really don't want to have to do this manually, but I also don't want to waste any more time trying to script this out if it's a pointless endeavor.

The code below should address the two issues described in your post. Should be selective enough to avoid misinterpreting rows, but some manual curation will likely still be necessary.
Basic concept is to iterate row by row, processing as much as possible before proceeding. First priority is to split rows containing character "/". If none is found, range value "-" is interpreted. The while loop permits gradual improvement. For example, the code would convert 1-3 into 1/2/3, then re-read the same row and split it into 3 different rows.
# build demo dataframe
d = {"Sect1": ["L/L/L/L", "V", "T", "R", "U"],
"Sect2": ["320/320/319/321", "602", "101", "158", "818"],
"Sect3": ["D/C/D/C", "-", "F", "-", "4"],
"Seat#": ["1-2/1-2/1-2/1-2", "1-6", "1&3", "3* 4", "Ds9R"]}
df = pd.DataFrame(data=d)
index = 0
while index < len(df):
len_df = len(df)
row_li = [df.iloc[index][x] for x in df.head()]
# extract separated values
sep_li = [x.split("/") for x in row_li]
sep_min, sep_max = len(min(sep_li, key=lambda x: len(x))), len(max(sep_li, key=lambda x: len(x)))
# extract range values
num_range_li = [re.findall("^\d+\-\d+$|$", x)[0].split("-") for x in row_li]
num_range_max = len(max(num_range_li, key=lambda x: len(x)))
# create temporary dictionary representing current row
r = {}
for i, head in enumerate(df.head()):
r[head] = row_li[i]
# separated values treatment -> split into distinct rows
if sep_min > 1 and sep_min == sep_max:
for i, head in enumerate(df.head()):
r[head] = sep_li[i]
row_df = pd.DataFrame(data=r)
df = df.append(row_df, ignore_index=True)
# range values treatment -> convert into separated values
elif num_range_max > 1:
for part in (1, 2):
for idx, header in enumerate(df.head()):
if len(num_range_li[idx]) > 1:
split_li = [str(x) for x in range(int(num_range_li[idx][0]), int(num_range_li[idx][1])+1)]
# convert range values to separated values
if part == 1:
r[header] = "/".join(split_li)
# multiply other values
else:
for i, head in enumerate(df.head()):
if i != idx:
r[head] = "/".join([r[head] for x in range(len(split_li))])
row_df = pd.DataFrame(data=r, index=[0])
df = df.append(row_df, ignore_index=True)
# if no new rows are added, increment
if len(df) == len_df:
index += 1
# if rows are added, drop current row
else:
df = df.iloc[:index].append(df.iloc[index+1:])
print(df)
Output
Sect1 Sect2 Sect3 Seat#
0 T 101 F 1&3
1 R 158 - 3* 4
2 U 818 4 Ds9R
4 V 602 - 1
5 V 602 - 2
6 V 602 - 3
7 V 602 - 4
8 V 602 - 5
9 V 602 - 6
10 L 320 D 1
11 L 320 D 2
12 L 320 C 1
13 L 320 C 2
14 L 319 D 1
15 L 319 D 2
16 L 321 C 1
17 L 321 C 2

Related

Create a new data frame column based on another data frame column values

I have two data frames df1 and df2
no plan current flag
0 abc1 249 30 Y/U
1 abc2 249 30 N
2 abc3 249 30 Y/D
and
plan offer
0 149 20
1 249 30
2 349 40
I want to put an extra column in df1 such that if df1['flag'] == 'Y/U' it will search the next higher number in df2['offer'] comparing df1['current']. Similarly, the same rule applies for a lower number, where the flag is Y/D. (Keep it as usual if the flag is N)
Expected output:
no plan current flag Pos
0 abc1 249 30 Y/U 40
1 abc2 249 30 N 30
2 abc3 249 30 Y/D 20
I tried to do it using apply.
df1['pos'] = (df1.apply(lambda x: next((z for (y, z) in zip(df2['plan'], df2['offer'])
if y > x['plan'] if z > x['current']), None), axis=1))
But it is giving the result considering every cases 'Y/U'.
Without using plan you can achieve the desired result like this.
You can just use a list.
offers = df2['offer'].sort_values().tolist()
def assign_pos(row, offers):
index = offers.index(row['current'])
if row['flag'] == "N":
row['pos'] = row['current']
elif row['flag'] == 'Y/U':
row['pos'] = offers[index + 1]
elif row['flag'] == 'Y/D':
row['pos'] = offers[index - 1]
return row
df1 = df1.apply(assign_pos, args=[offers], axis=1)

Optimising finding row pairs in a dataframe

I have a dataframe with rows that describe a movement of value between nodes in a system. This dataframe looks like this:
index from_node to_node value invoice_number
0 A E 10 a
1 B F 20 a
2 C G 40 c
3 D H 60 d
4 E I 35 c
5 X D 43 d
6 Y F 50 d
7 E H 70 a
8 B A 55 b
9 X B 33 a
I am looking to find "swaps" in the invoice history. A swap is defined where a node both receives and sends a value to a different node within the same invoice number. In the above dataset there are two swaps in invoice "a", and one swap in invoice "d" ("sent to" and "received from" could be the same node in the same row):
index node sent_to sent_value received_from received_value invoice_number
0 B F 20 X 33 a
1 E H 70 A 10 a
2 D H 60 X 43 d
I have solved this problem by iterating over all of the unique invoice numbers in the dataset and then iterating over each row within that invoice number to find pairs:
import pandas as pd
df = pd.DataFrame({
'from_node':['A','B','C','D','E','X','Y','E','B','X'],
'to_node':['E','F','G','H','I','D','F','H','A','B'],
'value':[10,20,40,60,35,43,50,70,55,33],
'invoice_number':['a','a','c','d','c','d','d','a','b','a'],
})
invoices = set(df.invoice_number)
list_df_swap = []
for invoice in invoices:
df_inv = df[df.invoice_number.isin([invoice])]
for r in df_inv.itertuples():
df_is_swap = df_inv[df_inv.to_node.isin([r.from_node])]
if len(df_is_swap.index) == 1:
swap = {'node': r.from_node,
'sent_to': r.to_node,
'sent_value': r.value,
'received_from': df_is_swap.iloc[0]['from_node'],
'received_value': df_is_swap.iloc[0]['value'],
'invoice_number': r.invoice_number
}
list_df_swap.append(pd.DataFrame(swap, index = [0]))
df_swap = pd.concat(list_df_swap, ignore_index = True)
The total dataset consists of several hundred million rows, and this approach is not very efficient. Is there a way to solve this problem using some kind of vectorised solution, or another method that would speed up the execution time?
Calculate all possible swaps, regradless of the invoice number:
swaps = df.merge(df, left_on='from_node', right_on='to_node')
Then select those that have the same invoice number:
columns = ['from_node_x', 'to_node_x', 'value_x', 'from_node_y', 'value_y',
'invoice_number_x']
swaps[swaps.invoice_number_x == swaps.invoice_number_y][columns]
# from_node_x to_node_x value_x from_node_y value_y invoice_number_x
#1 B F 20 X 33 a
#3 D H 60 X 43 d
#5 E H 70 A 10 a

How to parse tables from .txt files using Pandas

I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?
With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None
Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)

Pandas track consecutive near numbers via compare-cumsum-groupby pattern

I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41

How to preview a part of a large pandas DataFrame, in iPython notebook?

I am just getting started with pandas in the IPython Notebook and encountering the following problem: When a DataFrame read from a CSV file is small, the IPython Notebook displays it in a nice table view. When the DataFrame is large, something like this is ouput:
In [27]:
evaluation = readCSV("evaluation_MO_without_VNS_quality.csv").filter(["solver", "instance", "runtime", "objective"])
In [37]:
evaluation
Out[37]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 332
Data columns:
solver 333 non-null values
instance 333 non-null values
runtime 333 non-null values
objective 333 non-null values
dtypes: int64(1), object(3)
I would like to see a small portion of the data frame as a table just to make sure it is in the right format. What options do I have?
df.head(5) # will print out the first 5 rows
df.tail(5) # will print out the 5 last rows
In this case, where the DataFrame is long but not too wide, you can simply slice it:
>>> df = pd.DataFrame({"A": range(1000), "B": range(1000)})
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns:
A 1000 non-null values
B 1000 non-null values
dtypes: int64(2)
>>> df[:5]
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
ix is deprecated.
If it's both wide and long, I tend to use .ix:
>>> df = pd.DataFrame({i: range(1000) for i in range(100)})
>>> df.ix[:5, :10]
0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5
I write a method to show the four corners of the data and monkey-patch to dataframe to do so:
def _sw(df, up_rows=10, down_rows=5, left_cols=4, right_cols=3, return_df=False):
''' display df data at four corners
A,B (up_pt)
C,D (down_pt)
parameters : up_rows=10, down_rows=5, left_cols=4, right_cols=3
usage:
df = pd.DataFrame(np.random.randn(20,10), columns=list('ABCDEFGHIJKLMN')[0:10])
df.sw(5,2,3,2)
df1 = df.set_index(['A','B'], drop=True, inplace=False)
df1.sw(5,2,3,2)
'''
#pd.set_printoptions(max_columns = 80, max_rows = 40)
ncol, nrow = len(df.columns), len(df)
# handle columns
if ncol <= (left_cols + right_cols) :
up_pt = df.ix[0:up_rows, :] # screen width can contain all columns
down_pt = df.ix[-down_rows:, :]
else: # screen width can not contain all columns
pt_a = df.ix[0:up_rows, 0:left_cols]
pt_b = df.ix[0:up_rows, -right_cols:]
pt_c = df[-down_rows:].ix[:,0:left_cols]
pt_d = df[-down_rows:].ix[:,-right_cols:]
up_pt = pt_a.join(pt_b, how='inner')
down_pt = pt_c.join(pt_d, how='inner')
up_pt.insert(left_cols, '..', '..')
down_pt.insert(left_cols, '..', '..')
overlap_qty = len(up_pt) + len(down_pt) - len(df)
down_pt = down_pt.drop(down_pt.index[range(overlap_qty)]) # remove overlap rows
dt_str_list = down_pt.to_string().split('\n') # transfer down_pt to string list
# Display up part data
print up_pt
start_row = (1 if df.index.names[0] is None else 2) # start from 1 if without index
# Display omit line if screen height is not enought to display all rows
if overlap_qty < 0:
print "." * len(dt_str_list[start_row])
# Display down part data row by row
for line in dt_str_list[start_row:]:
print line
# Display foot note
print "\n"
print "Index :",df.index.names
print "Column:",",".join(list(df.columns.values))
print "row: %d col: %d"%(len(df), len(df.columns))
print "\n"
return (df if return_df else None)
DataFrame.sw = _sw #add a method to DataFrame class
Here is the sample:
>>> df = pd.DataFrame(np.random.randn(20,10), columns=list('ABCDEFGHIJKLMN')[0:10])
>>> df.sw()
A B C D .. H I J
0 -0.8166 0.0102 0.0215 -0.0307 .. -0.0820 1.2727 0.6395
1 1.0659 -1.0102 -1.3960 0.4700 .. 1.0999 1.1222 -1.2476
2 0.4347 1.5423 0.5710 -0.5439 .. 0.2491 -0.0725 2.0645
3 -1.5952 -1.4959 2.2697 -1.1004 .. -1.9614 0.6488 -0.6190
4 -1.4426 -0.8622 0.0942 -0.1977 .. -0.7802 -1.1774 1.9682
5 1.2526 -0.2694 0.4841 -0.7568 .. 0.2481 0.3608 -0.7342
6 0.2108 2.5181 1.3631 0.4375 .. -0.1266 1.0572 0.3654
7 -1.0617 -0.4743 -1.7399 -1.4123 .. -1.0398 -1.4703 -0.9466
8 -0.5682 -1.3323 -0.6992 1.7737 .. 0.6152 0.9269 2.1854
9 0.2361 0.4873 -1.1278 -0.2251 .. 1.4232 2.1212 2.9180
10 2.0034 0.5454 -2.6337 0.1556 .. 0.0016 -1.6128 -0.8093
..............................................................
15 1.4091 0.3540 -1.3498 -1.0490 .. 0.9328 0.3668 1.3948
16 0.4528 -0.3183 0.4308 -0.1818 .. 0.1295 1.2268 0.1365
17 -0.7093 1.3991 0.9501 2.1227 .. -1.5296 1.1908 0.0318
18 1.7101 0.5962 0.8948 1.5606 .. -0.6862 0.9558 -0.5514
19 1.0329 -1.2308 -0.6896 -0.5112 .. 0.2719 1.1478 -0.1459
Index : [None]
Column: A,B,C,D,E,F,G,H,I,J
row: 20 col: 10
>>> df.sw(4,2,3,4)
A B C .. G H I J
0 -0.8166 0.0102 0.0215 .. 0.3671 -0.0820 1.2727 0.6395
1 1.0659 -1.0102 -1.3960 .. 1.0984 1.0999 1.1222 -1.2476
2 0.4347 1.5423 0.5710 .. 1.6675 0.2491 -0.0725 2.0645
3 -1.5952 -1.4959 2.2697 .. 0.4856 -1.9614 0.6488 -0.6190
4 -1.4426 -0.8622 0.0942 .. -0.0947 -0.7802 -1.1774 1.9682
..............................................................
18 1.7101 0.5962 0.8948 .. -0.8592 -0.6862 0.9558 -0.5514
19 1.0329 -1.2308 -0.6896 .. -0.3954 0.2719 1.1478 -0.1459
Index : [None]
Column: A,B,C,D,E,F,G,H,I,J
row: 20 col: 10
This line will allow you to see all rows (up to the number that you set as 'max_rows') without any rows being hidden by the dots ('.....') that normally appear between head and tail in the print output.
pd.options.display.max_rows = 500
Here's a quick way to preview a large table without having it run too wide:
Display function:
# display large dataframes in an html iframe
def ldf_display(df, lines=500):
txt = ("<iframe " +
"srcdoc='" + df.head(lines).to_html() + "' " +
"width=1000 height=500>" +
"</iframe>")
return IPython.display.HTML(txt)
Now just run this in any cell:
ldf_display(large_dataframe)
This will convert the dataframe to html then display it in an iframe. The advantage is that you can control the output size and have easily accessible scroll bars.
Worked for my purposes, maybe it will help someone else.
To see the first n rows of DataFrame:
df.head(n) # (n=5 by default)
To see the last n rows:
df.tail(n)
I found the following approach to be the most effective for sampling a DataFrame:
print(df[A:B]) ## 'A' and 'B' are the first and last records in range
For example, print(df[10:15]) will print rows 10 through 15 - inclusive - from your data set.
You can just use nrows. For instance
pd.read_csv('data.csv',nrows=6)
will show the first 6 rows from data.csv.
Update one to generate string instead, and accommodate to Pandas0.13+
def _sw2(df, up_rows=5, down_rows=3, left_cols=4, right_cols=2, return_df=False):
""" return df data display string at four corners
A,B (up_pt)
C,D (down_pt)
parameters : up_rows=10, down_rows=5, left_cols=4, right_cols=3
usage:
df = pd.DataFrame(np.random.randn(20,10), columns=list('ABCDEFGHIJKLMN')[0:10])
df.sw(5,2,3,2)
df1 = df.set_index(['A','B'], drop=True, inplace=False)
df1.sw(5,2,3,2)
"""
#pd.set_printoptions(max_columns = 80, max_rows = 40)
nrow, ncol = df.shape #ncol, nrow = len(df.columns), len(df)
# handle columns
if ncol <= (left_cols + right_cols) :
up_pt = df.ix[0:up_rows, :] # screen width can contain all columns
down_pt = df.ix[-down_rows:, :]
else: # screen width can not contain all columns
pt_a = df.ix[0:up_rows, 0:left_cols]
pt_b = df.ix[0:up_rows, -right_cols:]
pt_c = df[-down_rows:].ix[:,0:left_cols]
pt_d = df[-down_rows:].ix[:,-right_cols:]
up_pt = pt_a.join(pt_b, how='inner')
down_pt = pt_c.join(pt_d, how='inner')
up_pt.insert(left_cols, '..', '..')
down_pt.insert(left_cols, '..', '..')
overlap_qty = len(up_pt) + len(down_pt) - len(df)
down_pt = down_pt.drop(down_pt.index[range(overlap_qty)]) # remove overlap rows
dt_str_list = down_pt.to_string().split('\n') # transfer down_pt to string list
# Display up part data
ds = up_pt.__str__()
#get rid of ending part of Pandas0.13+ display string by finding the last 3 '\n', ugly though
Display_str = ds[0:ds[0:ds[0:ds.rfind('\n')].rfind('\n')].rfind('\n')] #refer to http://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python
start_row = (1 if df.index.names[0] is None else 2) # start from 1 if without index
# Display omit line if screen height is not enought to display all rows
if overlap_qty < 0:
Display_str += "\n"
Display_str += "." * len(dt_str_list[start_row])
Display_str += "\n"
# Display down part data row by row
for line in dt_str_list[start_row:]:
Display_str += "\n"
Display_str += line
# Display foot note
Display_str += "\n\n"
Display_str += "Index : %s\n"%str(df.index.names)
col_name_list = list(df.columns.values)
if ncol < 10:
col_name_str = ", ".join(col_name_list)
else:
col_name_str = ", ".join(col_name_list[0:7]) + ' ... ' + ", ".join(col_name_list[-2:])
Display_str = Display_str + "Column: " + col_name_str + "\n"
Display_str = Display_str + "row: %d col: %d"%(nrow, ncol) + " "
dty_dict={} #simulate defaultdict
for k,g in itertools.groupby(list(df.dtypes.values)): #http://stackoverflow.com/questions/13565248/grouping-the-same-recurring-items-that-occur-in-a-row-from-list/13565414#13565414
try:
dty_dict[k] = dty_dict[k] + len(list(g))
except:
dty_dict[k] = len(list(g))
for key in dty_dict:
Display_str += "{0}: {1} ".format(key, dty_dict[key])
Display_str += "\n\n"
return (df if return_df else Display_str)
In order to view only first few entries you can use, pandas head function which is used as
dataframe.head(any number) // default is 5
dataframe.head(n=value)
or you can also you slicing for this purpose, which can also give the same result,
dataframe[:n]
In order to view the last few entries you can use pandas tail() in a similar way,
dataframe.tail(any number) // default is 5
dataframe.tail(n=value)
In Python pandas provide head() and tail() to print head and tail data respectively.
import pandas as pd
train = pd.read_csv('file_name')
train.head() # it will print 5 head row data as default value is 5
train.head(n) # it will print n head row data
train.tail() #it will print 5 tail row data as default value is 5
train.tail(n) #it will print n tail row data

Categories