Example input:
[('b', 'c', 4),
('l', 'r', 5),
('i', 'a', 6),
('c', 't', 7),
('a', '$', 8),
('n', '$', 9)]
[0] contains the vertical heading, [1] contains the horizontal heading.
Example output:
c r a t $ $
b 4
l 5
i 6
c 7
a 8
n 9
Note: given enough tuples the entire table could be filled :P
How do I format output as a table in Python using [preferably] one line of code?
Here's an answer for your revised question:
data = [
['A','a','1'],
['B','b','2'],
['C','c','3'],
['D','d','4']
]
# Desired output:
#
# A B C D
# a 1
# b 2
# c 3
# d 4
# Check data consists of colname, rowname, value triples
assert all([3 == len(row) for row in data])
# Convert all data to strings
data = [ [str(c) for c in r] for r in data]
# Check all data is one character wide
assert all([1 == len(s) for s in r for r in data])
#============================================================================
# Verbose version
#============================================================================
col_names, row_names, values = zip(*data) # Transpose
header_line = ' ' + ' '.join(col_names)
row_lines = []
for idx, (row_name, value) in enumerate(zip(row_names,values)):
# Use ' '*n to get 2n consecutive spaces.
row_line = row_name + ' ' + ' '*idx + value
row_lines.append(row_line)
print header_line
for r in row_lines:
print (r)
Or, if that's too long for you, try this:
cs, rs, vs = zip(*data)
print ('\n'.join([' '+' '.join(cs)] + [r+' '+' '*i+v for i,(r,v) in enumerate(zip(rs,vs))]))
Both have the following output:
A B C D
a 1
b 2
c 3
d 4
Here's the kernel of what you want (no reader row or header column)
>>> print('\n'.join([ ''.join([str(i+j+2).rjust(3)
for i in range(10)]) for j in range(10) ]))
2 3 4 5 6 7 8 9 10 11
3 4 5 6 7 8 9 10 11 12
4 5 6 7 8 9 10 11 12 13
5 6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14 15
7 8 9 10 11 12 13 14 15 16
8 9 10 11 12 13 14 15 16 17
9 10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18 19
11 12 13 14 15 16 17 18 19 20
It uses a nested list comprehension over i and j to generate the numbers i+j, then str.rjust() to pad all fields to three characters in length, and finally some str.join()s to put all the substrings together.
Assuming python 2.x, it's a bit ugly, but it's functional:
import operator
from functools import partial
x = range(1,11)
y = range(0,11)
multtable = [y]+[[i]+map(partial(operator.add,i),y[1:]) for i in x]
for i in multtable:
for j in i:
print str(j).rjust(3),
print
0 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 11
2 3 4 5 6 7 8 9 10 11 12
3 4 5 6 7 8 9 10 11 12 13
4 5 6 7 8 9 10 11 12 13 14
5 6 7 8 9 10 11 12 13 14 15
6 7 8 9 10 11 12 13 14 15 16
7 8 9 10 11 12 13 14 15 16 17
8 9 10 11 12 13 14 15 16 17 18
9 10 11 12 13 14 15 16 17 18 19
10 11 12 13 14 15 16 17 18 19 20
Your problem is so darn specific, it's difficult to make a real generic example.
The important part here, though, is the part that makes the table, rathter than the actual printing:
[map(partial(operator.add,i),y[1:]) for i in x]
Related
I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
Is there a more efficient way to remove the 0 from the beginning and insert the 20 at the end and retain the shape (1, 20)?
# What I have.
array = np.arange(20)[np.newaxis]
print(array.shape, array)
# Remove 0 from the beginning and add 20 to the end.
array = np.append(array[0, 1:], np.array([[20]]))
print(array)
array = array[np.newaxis]
print(array.shape, array)
Output:
(1, 20) [[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]]
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
(1, 20) [[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]]
You can just select a subset of the current array excluding the first element and then add 20 or whatever scalar you want at the end.
x = np.append(array[:,1:],[[20]], axis=1)
Maybe like this:
array= np.linspace(start=1, stop=20, num=20, endpoint=True, dtype=int)[np.newaxis]
print(array.shape, array)
Assume I have a data frame like:
import pandas as pd
df = pd.DataFrame({"user_id": [1, 5, 11],
"user_type": ["I", "I", "II"],
"joined_for": [1.4, 9.4, 18.1]})
Now I'd like to:
Take each user's joined_for and get the ceiling integer.
Based on the integer, create a new data frame containing number sequences where the maximum is the ceiling number.
This is how I do it now:
import math
new_df = pd.DataFrame()
for i in range(df.shape[0]):
ceil_num = math.ceil(df.iloc[i]["joined_for"])
new_df = new_df.append(pd.DataFrame({"user_id": df.iloc[i]["user_id"],
"joined_month": range(1, ceil_num+1)}),
ignore_index=True)
new_df = new_df.merge(df.drop(columns="joined_for"), on="user_id")
new_df is what I want, but it's so time-consuming when there are lots of users and the number of joined_for can be larger. Is there any better way to do this? Faster or neater?
Using a comprehension
pd.DataFrame([
[t.user_id, m, t.user_type] for t in df.itertuples(index=False)
for m in range(1, math.ceil(t.joined_for) + 1)
], columns=['user_id', 'joined_month', 'user_type'])
user_id joined_month user_type
0 1 1 I
1 1 2 I
2 5 1 I
3 5 2 I
4 5 3 I
5 5 4 I
6 5 5 I
7 5 6 I
8 5 7 I
9 5 8 I
10 5 9 I
11 5 10 I
12 11 1 II
13 11 2 II
14 11 3 II
15 11 4 II
16 11 5 II
17 11 6 II
18 11 7 II
19 11 8 II
20 11 9 II
21 11 10 II
22 11 11 II
23 11 12 II
24 11 13 II
25 11 14 II
26 11 15 II
27 11 16 II
28 11 17 II
29 11 18 II
30 11 19 II
I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.
I have two text files A and B, with 16 and 14 columns respectively.
The columns in these files are separated with spaces.
For each entry in column 9 of file A, I want to check if the entry is in column 8 of file B.
If it is, I would like to add this value to a new file (file C). However, I would like file C to retain the same format as file A.
In other words, this new file should contain 17 columns as well.
I have been unable to figure out how to approach this problem and cannot include my progress as a result. Any help is appreciated.
Thank you in advance.
You can read both files into a list, extract B's 8th column in a list and then iterate over file A and check if its 9th element matches with the list of column 8 of B.
If there is a match then I am appending the match at end of each line of A else just print line A.
NOTE: if you do not need the line when there is no match then you can delete the else part.
Code
alines = [line.rstrip('\n') for line in open('aa.txt')]
blines = [line.rstrip('\n') for line in open('bb.txt')]
column8b=[]
for line in blines:
column8b.append(line.split(" ")[7])
with open('cc.txt', "w") as oFile:
for line in alines:
element = line.split(" ")[8]
if element in column8b:
oFile.write(line + " " + element + "\n")
## Delete this if you do not want to write A into C
## when there is no match between A[9] and B[8]
else:
oFile.write(line + "\n")
Sample Data:
aa.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
bb.txt
1 2 3 4 5 6 7 16 9 10 11 12 13 14
1 2 3 4 5 6 7 36 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
cc.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16 36
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
If you read in the file line by line, then you can pull out the relevant information you want.
your_file_A = open("FILEPATH.EXTENSION")
your_file_B = open("FILEPATH.EXTENSION")
your_file_C = open("FILEPATH.EXTENSION", 'w')
col8_of_B=[]
for line in your_file_B:
col8_of_B.append(line[7]) #line[7] is position 8
for line in your_file_A:
if line[8] in col8_of_B:
your_file_C.write(line)
What about awk (since you have the bash tag)?:
awk 'FNR==NR {b[$8]=$0;next} b[$9] {print $0,$9 }' b a > c