How to use while loop properly?

How to use while loop properly? - python

I have a dataframe:
SIN1 SIN2 SIN3
4778 5633 4343
I have tried,
count_col = len(df1.columns)
check=1
while check<=count_col:
check_str = str(check)
check_for_column = "SIN"+check_str in df1
name = "SIN"+check_str
if check_for_column == True:
df1[name] = df1[name].astype(str)
df1['SIN1'] = df1['SIN1'] + ',' + df1[name]
if check == count_col:
break
check += 1
df1[['SIN1']]
This shows 4778,4343,4778,4343.................
When I tried,
check=1
while check<=count_col:
check_str = str(check)
check_for_column = "SIN"+check_str in df1
name = "SIN"+check_str
if check == count_col:
break
check += 1
if check_for_column == True:
df1[name] = df1[name].astype(str)
df1['SIN1'] = df1['SIN1'] + ',' + df1[name]
df1[['SIN1']]
This shows 4778,4343
I want the result to be, 4778,5633,4343
Please don't suggest a way to directly concatenate with ','.
I used while loop because there can be any no of SIN columns so.
How to properly use while loop in this case?

Use apply to join column values:
>>> df['SIN'] = df.astype(str).apply(lambda x: ','.join(x), axis=1)
>>> df
SIN1 SIN2 SIN3 SIN
0 4778 5633 4343 4778,5633,4343
To select a subset of columns like SINxx, use filter:
df.filter(like='SIN') # or df.filter(regex='SIN\d+')

This code takes all columns with SIN in them, concatenates the values as a string, and assigns it to a new column SIN.
sin_cols = [ ( 'SIN' in col ) for col in df.columns ]
sdf = df.loc[ :, sin_cols ]
df[ 'SIN' ] = sdf.apply( lambda x: ', '.join( x.values.astype( str ) ), axis = 1 )
df before
id
T
SIN1
SIN2
Q
SIN3
0
8
3
6
9
8
1
1
6
1
7
1
2
5
2
4
8
6
df after
id
T
SIN1
SIN2
Q
SIN3
SIN
0
8
3
6
9
8
3, 6, 8
1
1
6
1
7
1
6, 1, 1
2
5
2
4
8
6
2, 4, 6

Related

pairs of rows with the highest string similarity

So i have this dataframe:
import pandas as pd
d = {'id': [1,1,1,1,2,2,3,3,3,4,4,4,4],
'name':['ada','aad','ada','ada','dddd','fdd','ccc','cccd','ood','aaa','aaa','aar','rrp']
,'amount':[2,-12,12,-12,5,-5,2,3,-5,3,-10,10,-10]}
df1 = pd.DataFrame(d)
df1
id name amount
0 1 ada 2
1 1 aad -12
2 1 ada 12
3 1 ada -12
4 2 dddd 5
5 2 fdd -5
6 3 ccc 2
7 3 cccd 3
8 3 ood -5
9 4 aaa 3
10 4 aaa -10
11 4 aar 10
12 4 rrp -10
First i want to find the matching positive for negative amounts per id, which i do through this:
def match_pos_neg(df):
return df[df["amount"].isin(-df["amount"])]
df1 = df1.groupby("id").apply(match_pos_neg).reset_index(0, drop=True)
df1
id name amount
1 1 aad -12
2 1 ada 12
3 1 ada -12
4 2 dddd 5
5 2 fdd -5
10 4 aaa -10
11 4 aar 10
12 4 rrp -10
Next thing i want to do is to get only the pairs of matching pos and neg numbers that also have the highest similarity in the string column 'name'.So if an id has two other negative numbers that match with the positive i want to isolate the pairs with the highest similarity per id, so i want my desired output to be like this:
id name amount
2 1 ada 12
3 1 ada -12
4 2 dddd 5
5 2 fdd -5
10 4 aaa -10
11 4 aar 10
I guess i have to use some type of string similarity index like sequencematcher or jaccard etc., but i am not sure how to work around this. Any help on how to get my desired output would be very much appreciated.

You can try something like this:
please notice you can change the information you print as you wish, just need to edit the return values from the function create_sim
import pandas as pd
from operator import itemgetter
d = {'id': [1,1,1,1,2,2,3,3,3,4,4,4,4],
'name':['ada','aad','ada','ada','dddd','fdd','ccc','cccd','ood','aaa','aaa','aar','rrp']
,'amount':[2,-12,12,-12,5,-5,2,3,-5,3,-10,10,-10]}
df1 = pd.DataFrame(d)
def match_pos_neg(df):
return df[df["amount"].isin(-df["amount"])]
df1 = df1.groupby("id").apply(match_pos_neg).reset_index(0, drop=True)
print(df1)
def split(word):
return [char for char in word]
def DistJaccard(str1, str2):
l1 = set(split(str1))
l2 = set(split(str2))
return float(len(l1 & l2)) / len(l1 | l2)
def create_sim(df, idx):
idx_id = df['id'].values[idx]
idx_amount = df['amount'].values[idx]
idx_name = df['name'].values[idx]
df_t = df.loc[df['id'] == idx_id]
pos = [i for i in list(df_t['amount']) if i > 0] or None
neg = [i for i in list(df_t['amount']) if i < 0] or None
if pos and neg:
l = [x for x in list(df_t['amount']) if x == idx_amount * -1]
if len(l) > 0:
df_t = df.loc[df['amount'] == idx_amount * -1]
compare_list = list(df_t['name'])
list_results = []
for item in compare_list:
sim = DistJaccard(idx_name, item)
list_results.append((item, sim))
return max(list_results, key=itemgetter(1))
return None
count = 0
for index, row in df1.iterrows():
res = create_sim(df1, count)
if res:
print(f"The most similar word of {row['name']} is {res[0]} with similarity of {res[1]}")
else:
print(f"No similar words of {row['name']}")
count+=1
Edit:
In order to make a DF with the results you can change it to this:
count = 0
item1_id = []
item1_row = []
item1_name = []
item2_id = []
item2_row = []
item2_name = []
for index, row in df1.iterrows():
res = create_sim(df1, count)
item1_id.append(row['id'])
item1_row.append(count)
item1_name.append(row['name'])
if res:
row_idx = df1.loc[(df1['id'] == res[2]) & (df1['name'] == res[0]) & (df1['amount'] != row['amount']), "name"].index.tolist()
item2_id.append(row['id'])
item2_row.append(row_idx[0])
item2_name.append(res[0])
else:
item2_id.append(None)
item2_row.append(None)
item2_name.append(None)
count+=1
final = pd.DataFrame(item1_id, columns=['item 1 id'])
final['item 1 row'] = item1_row
final['item 1 name'] = item1_name
final['item 2 id'] = item2_id
final['item 2 row'] = item2_row
final['item 2 name'] = item2_name
print(final)

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped

Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32

get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4

I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64

If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

Simple method to change DataFrame column type but use a default value for errors?

Suppose I have the following column.
>>> import pandas
>>> a = pandas.Series(['0', '1', '5', '1', None, '3', 'Cat', '2'])
I would like to be able to convert all the data in the column to type int, and any element that cannot be converted should be replaced with a 0.
My current solution to this is to use to_numeric with the 'coerce' option, fill any NaN with 0, and then convert to int (since the presence of NaN made the column float instead of int).
>>> pandas.to_numeric(a, errors='coerce').fillna(0).astype(int)
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Is there any method that would allow me to do this in one step rather than having to go through two intermediate states? I am looking for something that would behave like the following imaginary option to astype:
>>> a.astype(int, value_on_error=0)

Option 1
pd.to_numeric(a, 'coerce').fillna(0).astype(int)
Option 2
b = pd.to_numeric(a, 'coerce')
b.mask(b.isnull(), 0).astype(int)
Option 3
def try_int(x):
try:
return int(x)
except:
return 0
a.apply(try_int)
Option 4
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
pd.Series(b, a.index)
All produce
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Timing
Code Below
def pir1(a):
return pd.to_numeric(a, 'coerce').fillna(0).astype(int)
def pir2(a):
b = pd.to_numeric(a, 'coerce')
return b.mask(b.isnull(), 0).astype(int)
def try_int(x):
try:
return int(x)
except:
return 0
def pir3(a):
return a.apply(try_int)
def pir4(a):
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
return pd.Series(b, a.index)
def alt1(a):
return pd.to_numeric(a.where(a.str.isnumeric(), 0))
results = pd.DataFrame(
index=[1, 3, 10, 30, 100, 300, 1000, 3000, 10000],
columns='pir1 pir2 pir3 pir4 alt1'.split()
)
for i in results.index:
c = pd.concat([a] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(c)'.format(j)
setp = 'from __main__ import c, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(logx=True, logy=True)

a.where(a.str.isnumeric(),0).astype(int)
Output:
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64

pandas convert text feature to numeric value

I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers
#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
dataset['{0}_category'.format(col)] = dataset[col].astype('category')
I can convert the text to integers using this hack:
#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])
new_cols = {}
for col in object_types:
data_set = set(dataset[col].tolist())
data_indexed = {}
for i, item in enumerate(data_set):
data_indexed[item] = i
new_list = []
for item in dataset[col].tolist():
new_list.append(data_indexed[item])
new_cols[col]=new_list
for key, val in new_cols.items():
dataset['{0}_int_value'.format(key)] = val
But is there a better (or existing) way to do the same?

I would use factorize method, which is designed for this particular task:
In [90]: x
Out[90]:
A B
9 c z
10 c z
4 b x
5 b y
1 a w
7 b z
In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
A B
9 2 3
10 2 3
4 1 1
5 1 2
1 0 0
7 1 3
or:
In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
A B
9 0 0
10 0 0
4 1 1
5 1 2
1 2 3
7 1 0

consider df
df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
B=list('wwxxxyyzzzzz')))
df
you can convert to integers like this
def intify(s):
u = np.unique(s)
i = np.arange(len(u))
return s.map(dict(zip(u, i)))
or shorter version
def intify(s):
u = np.unique(s)
return s.map({k: i for i, k in enumerate(u)})
df.apply(intify)
Or in a single line
df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))

How to preview a part of a large pandas DataFrame, in iPython notebook?

I am just getting started with pandas in the IPython Notebook and encountering the following problem: When a DataFrame read from a CSV file is small, the IPython Notebook displays it in a nice table view. When the DataFrame is large, something like this is ouput:
In [27]:
evaluation = readCSV("evaluation_MO_without_VNS_quality.csv").filter(["solver", "instance", "runtime", "objective"])
In [37]:
evaluation
Out[37]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 332
Data columns:
solver 333 non-null values
instance 333 non-null values
runtime 333 non-null values
objective 333 non-null values
dtypes: int64(1), object(3)
I would like to see a small portion of the data frame as a table just to make sure it is in the right format. What options do I have?

df.head(5) # will print out the first 5 rows
df.tail(5) # will print out the 5 last rows

In this case, where the DataFrame is long but not too wide, you can simply slice it:
>>> df = pd.DataFrame({"A": range(1000), "B": range(1000)})
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns:
A 1000 non-null values
B 1000 non-null values
dtypes: int64(2)
>>> df[:5]
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
ix is deprecated.
If it's both wide and long, I tend to use .ix:
>>> df = pd.DataFrame({i: range(1000) for i in range(100)})
>>> df.ix[:5, :10]
0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5

I write a method to show the four corners of the data and monkey-patch to dataframe to do so:
def _sw(df, up_rows=10, down_rows=5, left_cols=4, right_cols=3, return_df=False):
''' display df data at four corners
A,B (up_pt)
C,D (down_pt)
parameters : up_rows=10, down_rows=5, left_cols=4, right_cols=3
usage:
df = pd.DataFrame(np.random.randn(20,10), columns=list('ABCDEFGHIJKLMN')[0:10])
df.sw(5,2,3,2)
df1 = df.set_index(['A','B'], drop=True, inplace=False)
df1.sw(5,2,3,2)
'''
#pd.set_printoptions(max_columns = 80, max_rows = 40)
ncol, nrow = len(df.columns), len(df)
# handle columns
if ncol <= (left_cols + right_cols) :
up_pt = df.ix[0:up_rows, :] # screen width can contain all columns
down_pt = df.ix[-down_rows:, :]
else: # screen width can not contain all columns
pt_a = df.ix[0:up_rows, 0:left_cols]
pt_b = df.ix[0:up_rows, -right_cols:]
pt_c = df[-down_rows:].ix[:,0:left_cols]
pt_d = df[-down_rows:].ix[:,-right_cols:]
up_pt = pt_a.join(pt_b, how='inner')
down_pt = pt_c.join(pt_d, how='inner')
up_pt.insert(left_cols, '..', '..')
down_pt.insert(left_cols, '..', '..')
overlap_qty = len(up_pt) + len(down_pt) - len(df)
down_pt = down_pt.drop(down_pt.index[range(overlap_qty)]) # remove overlap rows
dt_str_list = down_pt.to_string().split('\n') # transfer down_pt to string list
# Display up part data
print up_pt
start_row = (1 if df.index.names[0] is None else 2) # start from 1 if without index
# Display omit line if screen height is not enought to display all rows
if overlap_qty < 0:
print "." * len(dt_str_list[start_row])
# Display down part data row by row
for line in dt_str_list[start_row:]:
print line
# Display foot note
print "\n"
print "Index :",df.index.names
print "Column:",",".join(list(df.columns.values))
print "row: %d col: %d"%(len(df), len(df.columns))
print "\n"
return (df if return_df else None)
DataFrame.sw = _sw #add a method to DataFrame class
Here is the sample:
>>> df = pd.DataFrame(np.random.randn(20,10), columns=list('ABCDEFGHIJKLMN')[0:10])
>>> df.sw()
A B C D .. H I J
0 -0.8166 0.0102 0.0215 -0.0307 .. -0.0820 1.2727 0.6395
1 1.0659 -1.0102 -1.3960 0.4700 .. 1.0999 1.1222 -1.2476
2 0.4347 1.5423 0.5710 -0.5439 .. 0.2491 -0.0725 2.0645
3 -1.5952 -1.4959 2.2697 -1.1004 .. -1.9614 0.6488 -0.6190
4 -1.4426 -0.8622 0.0942 -0.1977 .. -0.7802 -1.1774 1.9682
5 1.2526 -0.2694 0.4841 -0.7568 .. 0.2481 0.3608 -0.7342
6 0.2108 2.5181 1.3631 0.4375 .. -0.1266 1.0572 0.3654
7 -1.0617 -0.4743 -1.7399 -1.4123 .. -1.0398 -1.4703 -0.9466
8 -0.5682 -1.3323 -0.6992 1.7737 .. 0.6152 0.9269 2.1854
9 0.2361 0.4873 -1.1278 -0.2251 .. 1.4232 2.1212 2.9180
10 2.0034 0.5454 -2.6337 0.1556 .. 0.0016 -1.6128 -0.8093
..............................................................
15 1.4091 0.3540 -1.3498 -1.0490 .. 0.9328 0.3668 1.3948
16 0.4528 -0.3183 0.4308 -0.1818 .. 0.1295 1.2268 0.1365
17 -0.7093 1.3991 0.9501 2.1227 .. -1.5296 1.1908 0.0318
18 1.7101 0.5962 0.8948 1.5606 .. -0.6862 0.9558 -0.5514
19 1.0329 -1.2308 -0.6896 -0.5112 .. 0.2719 1.1478 -0.1459
Index : [None]
Column: A,B,C,D,E,F,G,H,I,J
row: 20 col: 10
>>> df.sw(4,2,3,4)
A B C .. G H I J
0 -0.8166 0.0102 0.0215 .. 0.3671 -0.0820 1.2727 0.6395
1 1.0659 -1.0102 -1.3960 .. 1.0984 1.0999 1.1222 -1.2476
2 0.4347 1.5423 0.5710 .. 1.6675 0.2491 -0.0725 2.0645
3 -1.5952 -1.4959 2.2697 .. 0.4856 -1.9614 0.6488 -0.6190
4 -1.4426 -0.8622 0.0942 .. -0.0947 -0.7802 -1.1774 1.9682
..............................................................
18 1.7101 0.5962 0.8948 .. -0.8592 -0.6862 0.9558 -0.5514
19 1.0329 -1.2308 -0.6896 .. -0.3954 0.2719 1.1478 -0.1459
Index : [None]
Column: A,B,C,D,E,F,G,H,I,J
row: 20 col: 10

This line will allow you to see all rows (up to the number that you set as 'max_rows') without any rows being hidden by the dots ('.....') that normally appear between head and tail in the print output.
pd.options.display.max_rows = 500

Here's a quick way to preview a large table without having it run too wide:
Display function:
# display large dataframes in an html iframe
def ldf_display(df, lines=500):
txt = ("<iframe " +
"srcdoc='" + df.head(lines).to_html() + "' " +
"width=1000 height=500>" +
"</iframe>")
return IPython.display.HTML(txt)
Now just run this in any cell:
ldf_display(large_dataframe)
This will convert the dataframe to html then display it in an iframe. The advantage is that you can control the output size and have easily accessible scroll bars.
Worked for my purposes, maybe it will help someone else.

To see the first n rows of DataFrame:
df.head(n) # (n=5 by default)
To see the last n rows:
df.tail(n)

I found the following approach to be the most effective for sampling a DataFrame:
print(df[A:B]) ## 'A' and 'B' are the first and last records in range
For example, print(df[10:15]) will print rows 10 through 15 - inclusive - from your data set.

You can just use nrows. For instance
pd.read_csv('data.csv',nrows=6)
will show the first 6 rows from data.csv.

Update one to generate string instead, and accommodate to Pandas0.13+
def _sw2(df, up_rows=5, down_rows=3, left_cols=4, right_cols=2, return_df=False):
""" return df data display string at four corners
A,B (up_pt)
C,D (down_pt)
parameters : up_rows=10, down_rows=5, left_cols=4, right_cols=3
usage:
df = pd.DataFrame(np.random.randn(20,10), columns=list('ABCDEFGHIJKLMN')[0:10])
df.sw(5,2,3,2)
df1 = df.set_index(['A','B'], drop=True, inplace=False)
df1.sw(5,2,3,2)
"""
#pd.set_printoptions(max_columns = 80, max_rows = 40)
nrow, ncol = df.shape #ncol, nrow = len(df.columns), len(df)
# handle columns
if ncol <= (left_cols + right_cols) :
up_pt = df.ix[0:up_rows, :] # screen width can contain all columns
down_pt = df.ix[-down_rows:, :]
else: # screen width can not contain all columns
pt_a = df.ix[0:up_rows, 0:left_cols]
pt_b = df.ix[0:up_rows, -right_cols:]
pt_c = df[-down_rows:].ix[:,0:left_cols]
pt_d = df[-down_rows:].ix[:,-right_cols:]
up_pt = pt_a.join(pt_b, how='inner')
down_pt = pt_c.join(pt_d, how='inner')
up_pt.insert(left_cols, '..', '..')
down_pt.insert(left_cols, '..', '..')
overlap_qty = len(up_pt) + len(down_pt) - len(df)
down_pt = down_pt.drop(down_pt.index[range(overlap_qty)]) # remove overlap rows
dt_str_list = down_pt.to_string().split('\n') # transfer down_pt to string list
# Display up part data
ds = up_pt.__str__()
#get rid of ending part of Pandas0.13+ display string by finding the last 3 '\n', ugly though
Display_str = ds[0:ds[0:ds[0:ds.rfind('\n')].rfind('\n')].rfind('\n')] #refer to http://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python
start_row = (1 if df.index.names[0] is None else 2) # start from 1 if without index
# Display omit line if screen height is not enought to display all rows
if overlap_qty < 0:
Display_str += "\n"
Display_str += "." * len(dt_str_list[start_row])
Display_str += "\n"
# Display down part data row by row
for line in dt_str_list[start_row:]:
Display_str += "\n"
Display_str += line
# Display foot note
Display_str += "\n\n"
Display_str += "Index : %s\n"%str(df.index.names)
col_name_list = list(df.columns.values)
if ncol < 10:
col_name_str = ", ".join(col_name_list)
else:
col_name_str = ", ".join(col_name_list[0:7]) + ' ... ' + ", ".join(col_name_list[-2:])
Display_str = Display_str + "Column: " + col_name_str + "\n"
Display_str = Display_str + "row: %d col: %d"%(nrow, ncol) + " "
dty_dict={} #simulate defaultdict
for k,g in itertools.groupby(list(df.dtypes.values)): #http://stackoverflow.com/questions/13565248/grouping-the-same-recurring-items-that-occur-in-a-row-from-list/13565414#13565414
try:
dty_dict[k] = dty_dict[k] + len(list(g))
except:
dty_dict[k] = len(list(g))
for key in dty_dict:
Display_str += "{0}: {1} ".format(key, dty_dict[key])
Display_str += "\n\n"
return (df if return_df else Display_str)

In order to view only first few entries you can use, pandas head function which is used as
dataframe.head(any number) // default is 5
dataframe.head(n=value)
or you can also you slicing for this purpose, which can also give the same result,
dataframe[:n]
In order to view the last few entries you can use pandas tail() in a similar way,
dataframe.tail(any number) // default is 5
dataframe.tail(n=value)

In Python pandas provide head() and tail() to print head and tail data respectively.
import pandas as pd
train = pd.read_csv('file_name')
train.head() # it will print 5 head row data as default value is 5
train.head(n) # it will print n head row data
train.tail() #it will print 5 tail row data as default value is 5
train.tail(n) #it will print n tail row data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use while loop properly? - python

Use apply to join column values: >>> df['SIN'] = df.astype(str).apply(lambda x: ','.join(x), axis=1) >>> df SIN1 SIN2 SIN3 SIN 0 4778 5633 4343 4778,5633,4343 To select a subset of columns like SINxx, use filter: df.filter(like='SIN') # or df.filter(regex='SIN\d+')

Related

pairs of rows with the highest string similarity

Find longest run of consecutive zeros for each user in dataframe

Simple method to change DataFrame column type but use a default value for errors?

pandas convert text feature to numeric value

How to preview a part of a large pandas DataFrame, in iPython notebook?

Categories

Resources