Python: deleting the first n terms in a string - python

I have a .txt where I want to delete the first 7 characters (spaces included) from every line in the file,
I've tried the following:
with open('input_nnn.txt', 'r') as input:
with open('input_nnnn.txt', 'a') as output:
for line in input:
output.write(line[6])
output.write('\n')
However I end up with error:
File "spinf.py", line 4, in <module>
out.write(line[6])
IndexError: string index out of range
To make my question clearer lets say I have a file that looks like this:
1 z 3 4 5 a 7 seven 8 9 0 11 2
1 z 3 4 5 a 7 seven 8 9 0 11 2
1 z 3 4 5 a 7 seven 8 9 0 11 2
1 z 3 4 5 a 7 seven 8 9 0 11 2
1 z 3 4 5 a 7 seven 8 9 0 11 2
1 z 3 4 5 a 7 seven 8 9 0 11 2
I'd want my output to look like this:
5 a 7 seven 8 9 0 11 2
5 a 7 seven 8 9 0 11 2
5 a 7 seven 8 9 0 11 2
5 a 7 seven 8 9 0 11 2
5 a 7 seven 8 9 0 11 2
5 a 7 seven 8 9 0 11 2

The error seems to indicate that at least one of your lines in the file does not have 7 or more characters.
Maybe adding a check on the length of the string is a good idea.

Reading and writing simultaneously to the same file is not going to work well with python (atleast the standard libraries), because you mostly have low level control of things. Your code is going to look really weird, and possibly have bugs.
import os
with open('input_nnn.txt', 'r') as original_file:
with open('input_nnnn.txt.new', 'w') as new_file:
for line in original_file:
new_file.write(line[6:] + '\n')
os.replace('input_nnn.txt.new', 'input_nnn.txt')
You could also directly do this from bash using cut
cut -c6- <file1.txt >file1.txt.new
cp -f file1.txt.new file1.txt
rm file1.txt.new
also don't use input as a variable name

Related

Filling data in Pandas

I'm using pandas and i have a little data like that.
4 1
5 8
6 25
7 33
8 24
9 4
and I want to fill in missing parts. I want to like that :
1 0
2 0
3 0
4 1
5 8
6 25
7 33
8 24
9 4
10 0
It's gonna be a list for use. like that [0,0,0,1,8,25,33,24,4,0]
looked for a solution but couldn't find any. Any idea?
Try with reindex
l = s.reindex(range(10+1),fill_value=0).tolist()

how to read the text file read in data frame with split function?

I have this data set in an Excel file. I want to keep the data which have only length 6 and delete rest and export it in the split of single values stored in a separate column.
Please tell me if we have any function to split the numeric values in the file to read it and split
From your shared data it seems it has spaces between numbers so they will already be in str
you can try below code:
your df looks like this:
a
0 11
1 2
2 3 2 4
3 5
4 1
5 6
6 1 1
7 6
8 6 7 7 7 6 6 8 8 8
9 6 8 7 9 5 2 1 44 6 55
10 6 8 7 9 5 2 1 44 6 55 4 4 4 4
filter rows with len equal to 6
df=df[df['a'].str.len()==6]
then split them using split() method like this
df['a'].str.split(" ", expand = True)
output:
0 1 2 3
2 3 2 4
EDIT:
for having trouble with memory while reading a large file you can refer to this SO post
OR read the file in chunks and append/save the output in new file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

Remove parenthesis and contents in parenthesis if present in a df column

I have a dataframe where the top scores/instances have parenthesis. I would like to remove the parenthesis and only leave the number. How would I do so?
I have tried the code below, but it leaves me with nans for all other numbers that do not have paranthesis.
.str.replace(r"\(.*\)","")
This is what the columns look like:
0 1(1P)
1 3(3P)
2 2(2P)
3 4(RU)
4 5(RU)
5 6(RU)
6 8
7 7
8 11
9 13
I want clean columns with only numbers.
Thanks!
Reason is mixed values - numeric with strings, possible solution is:
df['a'] = df['a'].astype(str).str.replace(r"\(.*\)","").astype(int)
print (df)
a
0 1
1 3
2 2
3 4
4 5
5 6
6 8
7 7
8 11
9 13

Pandas - Randomly Replace 10% of rows with other rows

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Looping through specific columns in two separate text files

I have two text files A and B, with 16 and 14 columns respectively.
The columns in these files are separated with spaces.
For each entry in column 9 of file A, I want to check if the entry is in column 8 of file B.
If it is, I would like to add this value to a new file (file C). However, I would like file C to retain the same format as file A.
In other words, this new file should contain 17 columns as well.
I have been unable to figure out how to approach this problem and cannot include my progress as a result. Any help is appreciated.
Thank you in advance.
You can read both files into a list, extract B's 8th column in a list and then iterate over file A and check if its 9th element matches with the list of column 8 of B.
If there is a match then I am appending the match at end of each line of A else just print line A.
NOTE: if you do not need the line when there is no match then you can delete the else part.
Code
alines = [line.rstrip('\n') for line in open('aa.txt')]
blines = [line.rstrip('\n') for line in open('bb.txt')]
column8b=[]
for line in blines:
column8b.append(line.split(" ")[7])
with open('cc.txt', "w") as oFile:
for line in alines:
element = line.split(" ")[8]
if element in column8b:
oFile.write(line + " " + element + "\n")
## Delete this if you do not want to write A into C
## when there is no match between A[9] and B[8]
else:
oFile.write(line + "\n")
Sample Data:
aa.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
bb.txt
1 2 3 4 5 6 7 16 9 10 11 12 13 14
1 2 3 4 5 6 7 36 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
cc.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16 36
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
If you read in the file line by line, then you can pull out the relevant information you want.
your_file_A = open("FILEPATH.EXTENSION")
your_file_B = open("FILEPATH.EXTENSION")
your_file_C = open("FILEPATH.EXTENSION", 'w')
col8_of_B=[]
for line in your_file_B:
col8_of_B.append(line[7]) #line[7] is position 8
for line in your_file_A:
if line[8] in col8_of_B:
your_file_C.write(line)
What about awk (since you have the bash tag)?:
awk 'FNR==NR {b[$8]=$0;next} b[$9] {print $0,$9 }' b a > c

Categories