is there a z.fill function in R - python

I have a data frame
df=data.frame(f=c('a','ab','abc'),v=1:3)
and make a new column with:
df$c=paste(df$v,df$f,sep='')
the result is
> df
f v c
1 a 1 1a
2 ab 2 2ab
3 abc 3 3abc
I would like column c to be in this format:
> df
f v c
1 a 1 1 a
2 ab 2 2 ab
3 abc 3 3abc
such that the total length of the concatenated values is a fixed number (in this case 4 characters) and to fill it will a chosen character, such as | (in this case \w).
Is there a function like this in R? I think it is similar to the z.fill function in python, but I am not a python programmer, and would prefer to stay in R as opposed to switching between languages for processing. Ultimately, I am creating a supervariable of 10 columns, and think this would help in downstream processing
I guess it would be in the paste function, but not sure how to 'fill a factor' so that it is of a fixed width

You can use the format() function to pretty print the values of your column. For example:
> format(df$f, width = 3, justify = "right")
[1] " a" " ab" "abc"
So your code should be:
df <- within(df, {
c <- paste0(v, format(f, width = 3, justify = "right"))
})
df
The result:
> df
f v c
1 a 1 1 a
2 ab 2 2 ab
3 abc 3 3abc

You can use the formatCfunction as follow
df$c <- paste(df$v, formatC(as.character(df$f), width = 3, flag = " "), sep = "")
df
f v c
1 a 1 1 a
2 ab 2 2 ab
3 abc 3 3abc
DATA
df <- data.frame(f = c('a','ab','abc'), v=1:3)

Related

Is there a function to write certain values of a dataframe to a .txt file in Python?

I have a dataframe as follows:
Index A B C D E F
1 0 0 C 0 E 0
2 A 0 0 0 0 F
3 0 0 0 0 E 0
4 0 0 C D 0 0
5 A B 0 0 0 0
Basically I would like to write the dataframe to a txt file, such that every row consists of the index and the subsequent column name only, excluding the zeroes.
For example:
txt file
1 C E
2 A F
3 E
4 C D
5 A B
The dataset is quite big, about 1k rows, 16k columns. Is there any way I can do this using a function in Pandas?
Take a matrix vector multiplication between the boolean matrix generated by "is this entry "0" or not" and the columns of the dataframe, and write it to a text file with to_csv (thanks to #Andreas' answer!):
df.ne("0").dot(df.columns + " ").str.rstrip().to_csv("text_file.txt")
where we right strip the spaces at the end due to the added " " to the last entries.
If you don't want the name Index appearing in the text file, you can chain a rename_axis(index=None) to get rid of it i.e.,
df.ne("0").dot(df.columns + " ").str.rstrip().rename_axis(index=None)
and then to_csv as above.
You can try this (replace '0' with 0 if that are numeric 0 instead of string 0):
# Credits to Pygirl, made the code even better.
df.set_index('Index', inplace=True)
df = df.replace('0',np.nan)
df.stack().groupby(level=0).apply(list)
# Out[79]:
# variable
# 0 [C, E]
# 1 [A, F]
# 2 [E]
# 3 [C, D]
# 4 [A, B]
# Name: value, dtype: object
For the writing to text, you can use pandas as well:
df.to_csv('your_text_file.txt')
You could replace string '0' with empty string '', then so some string-list-join manipulation to get the final results. Finally append each line into a text file. See code:
df = pd.DataFrame([
['0','0','C','0','E','0'],
['A','0','0','0','0','F'],
['0','0','0','0','E','0'],
['0','0','C','D','0','0'],
['A','B','0','0','0','0']], columns=['A','B','C','D','E','F']
)
df = df.replace('0', '')
logfile = open('test.txt', 'a')
for i in range(len(df)):
temp = ''.join(list(df.loc[i,:]))
logfile.write(str(i+1) + ' ' + ' '.join(list(temp)) + '\n')
logfile.close()
Output test.txt
1 C E
2 A F
3 E
4 C D
5 A B

How do I determine how many characters in common two pandas columns have?

I have a dataframe with two columns. I want to know how many characters they have in common. The number of common elements should be a new column. Here's a minimally reproducible example.
What I have:
import pandas as pd
from string import ascii_lowercase
import numpy as np
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
Out[17]:
col_1 col_2
0 ollcgfmy daeubsrx
1 jtvtqoux xbgtrzno
2 irwmoqqa mdblczfa
3 jyebzpyd xwlynkhw
4 ifuqojvs lxotbsju
5 fybsqbku xwbluaek
6 oylztnpf gelonsay
7 zdkibutk ujlcwhfu
8 uhrcjbsk nhxhpoii
9 eocxreqz muvfwusi
What I need (the numbers are random):
Out[19]:
col_1 col_2 common_letters
0 ollcgfmy daeubsrx 1
1 jtvtqoux xbgtrzno 1
2 irwmoqqa mdblczfa 0
3 jyebzpyd xwlynkhw 3
4 ifuqojvs lxotbsju 3
5 fybsqbku xwbluaek 3
6 oylztnpf gelonsay 3
7 zdkibutk ujlcwhfu 3
8 uhrcjbsk nhxhpoii 1
9 eocxreqz muvfwusi 3
EDIT: to anyone reading this trying to get similarity between two strings, don't use this approach. other similarity measures exist, such as levenshtein or jaccard.
Using df.apply and set operations can be one way to solve the problem:
df["common_letters"] = df.apply(
lambda x: len(set(x["col_1"]).intersection(set(x["col_2"]))),
axis=1)
output:
col_1 col_2 common_letters
0 cgeabfem amnwfsde 4
1 vozgpmgs slfwvjnv 2
2 xyvktrfr jtzijmud 1
3 piexmmgh ydaxbmyo 2
4 iydpnwcu hhdxyptd 3
If you like sets you can go for:
df['common_letters'] = (df.col_1.apply(set).apply(len)
+ df.col_2.apply(set).apply(len)
- (df.col_1+df.col_2).apply(set).apply(len))
You can use numpy:
df["noCommonChars"]=np.bitwise_and(df["col_1"].map(set), df["col_2"].map(set)).str.len()
Output:
col_1 col_2 noCommonChars
0 smuxucyw hywtedvz 2
1 bniuqhkh axcuukjg 2
2 ttzehrtl nbmsmwsc 0
3 ndwyjusu dssmdnvb 3
4 zqvsvych wguthcwu 2
5 jlnpjqgn xgedmodm 1
6 ocjbtnpy lywjqkjf 2
7 tolrpshi hslxxmgo 4
8 ehatmryw fhpvluvq 1
9 icciebte joyiwooi 1
Edit
In order to include repeating characters - you can do:
from collections import Counter
df["common_letters_full"]=np.bitwise_and(df["col_1"].map(Counter), df["col_2"].map(Counter))
df["common_letters"]=df["common_letters_full"].map(dict.values).apply(sum)
#alternatively:
df["common_letters"]=df["common_letters_full"].apply(pd.Series).sum(axis=1)
I already see better answers :D , but here goes a kind of good one. I might have been able to use more from pandas:
I took some code from here
import pandas as pd
from string import ascii_lowercase
import numpy as np
def countPairs(s1, s2) :
n1 = len(s1) ;
n2 = len(s2);
# To store the frequencies of characters
# of string s1 and s2
freq1 = [0] * 26;
freq2 = [0] * 26;
# To store the count of valid pairs
count = 0;
# Update the frequencies of
# the characters of string s1
for i in range(n1) :
freq1[ord(s1[i]) - ord('a')] += 1;
# Update the frequencies of
# the characters of string s2
for i in range(n2) :
freq2[ord(s2[i]) - ord('a')] += 1;
# Find the count of valid pairs
for i in range(26) :
count += min(freq1[i], freq2[i]);
return count;
# This code is contributed by Ryuga
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
counts = []
for i in range(0,df.shape[0]):
counts.append(countPairs(df.iloc[i].col_1,df.iloc[i].col_2))
df["counts"] = counts
col_1 col_2 counts
0 ploatffk dwenjpmc 1
1 gjjupyqg smqtlmzc 1
2 cgtxexho hvwhpyfh 1
3 mifsbfhc ufalhlbi 4
4 qnjesfdn lyhrrnkf 2
5 omnumzmf dagttzqo 2
6 gsygkrrb aocfoqxk 1
7 wrgvruuw ydnlzvyf 1
8 ivkdxoft zmgcnrjr 0
9 vvthbzjj mmirlcvx 1

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

Fix columns indentation with python

There is a file format called .xyz that helps visualizing molecular bonds. Basically the format asks for a specific pattern:
At the first line there must be the number of atoms, which in my case is 30.
After that there should be the data where the first line is the name of the atom, in my case they are all carbon. The second line is the x information and the third line is the y information and the last line is the z information which are all 0 in my case. Indentation should be correct so that all of the corresponding lines should start at the same place. So something like this:
30
C x1 y1 z1
C x2 y2 z2
...
...
...
and not:
30
C x1 y1 z1
C x2 y2 z2
since this is the wrong indentation.
My generated data is stored like this in a .txt file:
C 2.99996 7.31001e-05 0
C 2.93478 0.623697 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.0079 2.22961 0
C 1.50006 2.59812 0
C 0.927076 2.8532 0
C 0.313848 2.98349 0
C -0.313623 2.9837 0
C -0.927229 2.85319 0
C -1.5003 2.5981 0
C -2.00732 2.22951 0
C -2.42686 1.76331 0
C -2.74119 1.22029 0
C -2.93437 0.623802 0
C -2.99992 -5.5509e-05 0
C -2.93416 -0.623574 0
C -2.7409 -1.22022 0
C -2.42726 -1.7634 0
C -2.00723 -2.22941 0
C -1.49985 -2.59809 0
C -0.92683 -2.85314 0
C -0.313899 -2.98358 0
C 0.31363 -2.98356 0
C 0.927096 -2.85308 0
C 1.50005 -2.59792 0
C 2.00734 -2.22953 0
C 2.4273 -1.76339 0
C 2.74031 -1.22035 0
C 2.93441 -0.623647 0
I want to correct the indentation of this by making all of the lines start from the same point. I tried to do this with AWK to no avail. So I turned to Python. So far I have this:
#!/usr/bin/env/python
text_file = open("output.txt","r")
lines = text_file.readlines()
myfile = open("output.xyz","w")
for line in lines:
atom, x, y, z = line.split()
x, y, z = map(float(x,y,z))
myfile.write("{}\t {}\t {}\t {}\t".format(atom,x,y,z))
myfile.close()
text_file.close()
but I don't know currently as to how indentation can be added into this.
tl;dr: I have a data file in .txt, I want to change it into .xyz that's been specified but I am running into problems with indentation.
It appears that I misinterpreted your requirement...
To achieve a fixed width output using awk, you could use printf with a format string like this:
$ awk '{printf "%-4s%12.6f%12.6f%5d\n", $1, $2, $3, $4}' data.txt
C 2.999960 0.000073 0
C 2.934780 0.623697 0
C 2.740920 1.220110 0
C 2.427020 1.763430 0
C 2.007900 2.229610 0
C 1.500060 2.598120 0
C 0.927076 2.853200 0
C 0.313848 2.983490 0
C -0.313623 2.983700 0
# etc.
Numbers after the % specify the width of the field. A negative number means that the output should be left aligned (as in the first column). I have specified 6 decimal places for the floating point numbers.
Original answer, in case it is useful:
To ensure that there is a tab character between each of the columns of your input, you could use this awk script:
awk '{$1=$1}1' OFS="\t" data.txt > output.xyz
$1=$1 just forces awk to touch each line, which makes sure that the new Output Field Separator (OFS) is applied.
awk scripts are built up from a series of condition { action }. If no condition is supplied, the action is performed for every line. If a condition but no action is supplied, the default action is to print the line. 1 is a condition that always evaluates to true, so awk prints the line.
Note that even though the columns are all tab-separated, they are still not lined up because the content of each column is of a variable length.
Your data has already been ill formatted and converted to string. To correctly allign the numeric and non-numeric data, you need to parse the individual fields to respective data types (possibly using duck-typing) before formating using str.format
for line in st.splitlines():
def convert(st):
try:
return int(st)
except ValueError:
pass
try:
return float(st)
except ValueError:
pass
return st
print "{:8}{:12.5f}{:12.5f}{:5d}".format(*map(convert,line.split()))
C 2.99996 0.00007 0
C 2.93478 0.62370 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.00790 2.22961 0
C 1.50006 2.59812 0
C 0.92708 2.85320 0
C 0.31385 2.98349 0
C -0.31362 2.98370 0
C -0.92723 2.85319 0
Using this: awk '{printf "%s\t%10f\t%10f\t%i\n",$1,$2,$3,$4}' atoms
give this output:
C 2.999960 0.000073 0
C 2.934780 0.623697 0
C 2.740920 1.220110 0
C 2.427020 1.763430 0
C 2.007900 2.229610 0
C 1.500060 2.598120 0
C 0.927076 2.853200 0
C 0.313848 2.983490 0
C -0.313623 2.983700 0
C -0.927229 2.853190 0
C -1.500300 2.598100 0
C -2.007320 2.229510 0
C -2.426860 1.763310 0
C -2.741190 1.220290 0
C -2.934370 0.623802 0
C -2.999920 -0.000056 0
C -2.934160 -0.623574 0
C -2.740900 -1.220220 0
C -2.427260 -1.763400 0
C -2.007230 -2.229410 0
C -1.499850 -2.598090 0
C -0.926830 -2.853140 0
C -0.313899 -2.983580 0
C 0.313630 -2.983560 0
C 0.927096 -2.853080 0
C 1.500050 -2.597920 0
C 2.007340 -2.229530 0
C 2.427300 -1.763390 0
C 2.740310 -1.220350 0
C 2.934410 -0.623647 0
Is it what you're meaning or did I misunderstood ?
Edit for side note: I used tabs \t for separation, a space could do too and I limited the output to a precision of 10, I didn't verify your input lenght
You can use string formatting to print values with consistent padding. For your case, you might write lines like this to the file:
>>> '%-12s %-12s %-12s %-12s\n' % ('C', '2.99996', '7.31001e-05', '0')
'C 2.99996 7.31001e-05 0 '
"%-12s" means "take the str() of the value and make it take up at least 12 characters left-justified.

Merge rows based on values and key string in awk, sed or python

I have the following input table:
1 2 A "aaa"
3 4 A "aaa"
5 6 A "aaa"
1 2 B "bbb"
3 4 B "bbb"
1 2 A "ccc"
I'd like to get:
output1 - from input print lowest and highest values from column 1 and column 2, respectively, with the same name in column 4
1 6 A "aaa"
1 4 B "bbb"
1 2 A "ccc"
output2 - from input print values in column 1 and 2 'between the rows'; take values from column 2 (row 1) and column 1 (row 2) into a new row 1 with the same name in column 4 (skip when name in column 4 changes, like in rows 3, 5, 6 of the input).
2 3 A "aaa"
4 5 A "aaa"
2 3 B "bbb"
I'd really appreciate your advice.
Thanks in advance!
Here is one way to do part #1 with awk
awk '!b[$3" "$4]||b[$3" "$4]>$1 {b[$3" "$4]=$1} !t[$3" "$4]||b[$3" "$4]<$2 {t[$3" "$4]=$2} END {for (i in b) print b[i],t[i],i}' file
1 2 A "ccc"
1 6 A "aaa"
1 4 B "bbb"
If column #3 is always connected to column #4
awk '!b[$4]||b[$4]>$1 {b[$4]=$1} !t[$4]||b[$4]<$2 {t[$4]=$2} {z[$4]=$3} END {for (i in b) print b[i],t[i],z[i],i}' file
1 6 A "aaa"
1 2 A "ccc"
1 4 B "bbb"
In python you can try the following solution. I edited it to make it accept not only consecutive numbers for indices.
# -*- encoding: utf-8 -*-
def get_min_max_index(data):
result = dict()
names = set([record[3] for record in data])
for name in names:
name_records = filter(lambda record: record[3] == name, data)
name_indices = map(lambda record: (record[0], record[1]), name_records)
record_id = name_records[0][2]
result[name] = (min(name_indices)[0], max(name_indices)[1], record_id, name_indices)
return result
def get_between_rows(data):
records_min_max = get_min_max_index(data)
result = list()
for i in range(len(data) - 1):
name = data[i][3]
max_ind = records_min_max[name][1]
if data[i][1] < max_ind:
result.append([data[i][1], data[i+1][0], data[i][2], data[i][3]])
return result
if __name__ == "__main__":
import sys
data = list()
for line in sys.stdin.readlines():
line = line.strip().split()
data.append([int(line[0]), int(line[1]), line[2], line[3].strip('"')])
for name, line in get_min_max_index(data).items():
print('{0} {1} {2} {3}'.format(line[0], line[1], line[2], name))
print('\n')
for line in get_between_rows(data):
print('{0} {1} {2} {3}'.format(line[0], line[1], line[2], line[3]))
# vim:expandtab:smartindent:tabstop=4:softtabstop=4:shiftwidth=4:
Here is the result of the command cat linked.txt | python linked.py
1 6 A aaa
1 4 B bbb
1 2 A ccc
2 3 A aaa
4 5 A aaa
2 3 B bbb

Categories