I'm using pandas and i have a little data like that.
4 1
5 8
6 25
7 33
8 24
9 4
and I want to fill in missing parts. I want to like that :
1 0
2 0
3 0
4 1
5 8
6 25
7 33
8 24
9 4
10 0
It's gonna be a list for use. like that [0,0,0,1,8,25,33,24,4,0]
looked for a solution but couldn't find any. Any idea?
Try with reindex
l = s.reindex(range(10+1),fill_value=0).tolist()
I have this data set in an Excel file. I want to keep the data which have only length 6 and delete rest and export it in the split of single values stored in a separate column.
Please tell me if we have any function to split the numeric values in the file to read it and split
From your shared data it seems it has spaces between numbers so they will already be in str
you can try below code:
your df looks like this:
a
0 11
1 2
2 3 2 4
3 5
4 1
5 6
6 1 1
7 6
8 6 7 7 7 6 6 8 8 8
9 6 8 7 9 5 2 1 44 6 55
10 6 8 7 9 5 2 1 44 6 55 4 4 4 4
filter rows with len equal to 6
df=df[df['a'].str.len()==6]
then split them using split() method like this
df['a'].str.split(" ", expand = True)
output:
0 1 2 3
2 3 2 4
EDIT:
for having trouble with memory while reading a large file you can refer to this SO post
OR read the file in chunks and append/save the output in new file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)
I have a dataframe where the top scores/instances have parenthesis. I would like to remove the parenthesis and only leave the number. How would I do so?
I have tried the code below, but it leaves me with nans for all other numbers that do not have paranthesis.
.str.replace(r"\(.*\)","")
This is what the columns look like:
0 1(1P)
1 3(3P)
2 2(2P)
3 4(RU)
4 5(RU)
5 6(RU)
6 8
7 7
8 11
9 13
I want clean columns with only numbers.
Thanks!
Reason is mixed values - numeric with strings, possible solution is:
df['a'] = df['a'].astype(str).str.replace(r"\(.*\)","").astype(int)
print (df)
a
0 1
1 3
2 2
3 4
4 5
5 6
6 8
7 7
8 11
9 13
I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.
I have two text files A and B, with 16 and 14 columns respectively.
The columns in these files are separated with spaces.
For each entry in column 9 of file A, I want to check if the entry is in column 8 of file B.
If it is, I would like to add this value to a new file (file C). However, I would like file C to retain the same format as file A.
In other words, this new file should contain 17 columns as well.
I have been unable to figure out how to approach this problem and cannot include my progress as a result. Any help is appreciated.
Thank you in advance.
You can read both files into a list, extract B's 8th column in a list and then iterate over file A and check if its 9th element matches with the list of column 8 of B.
If there is a match then I am appending the match at end of each line of A else just print line A.
NOTE: if you do not need the line when there is no match then you can delete the else part.
Code
alines = [line.rstrip('\n') for line in open('aa.txt')]
blines = [line.rstrip('\n') for line in open('bb.txt')]
column8b=[]
for line in blines:
column8b.append(line.split(" ")[7])
with open('cc.txt', "w") as oFile:
for line in alines:
element = line.split(" ")[8]
if element in column8b:
oFile.write(line + " " + element + "\n")
## Delete this if you do not want to write A into C
## when there is no match between A[9] and B[8]
else:
oFile.write(line + "\n")
Sample Data:
aa.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
bb.txt
1 2 3 4 5 6 7 16 9 10 11 12 13 14
1 2 3 4 5 6 7 36 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
cc.txt
1 2 3 4 5 6 7 8 16 10 11 12 13 14 15 16 16
1 2 3 4 5 6 7 8 26 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 36 10 11 12 13 14 15 16 36
1 2 3 4 5 6 7 8 46 10 11 12 13 14 15 16
If you read in the file line by line, then you can pull out the relevant information you want.
your_file_A = open("FILEPATH.EXTENSION")
your_file_B = open("FILEPATH.EXTENSION")
your_file_C = open("FILEPATH.EXTENSION", 'w')
col8_of_B=[]
for line in your_file_B:
col8_of_B.append(line[7]) #line[7] is position 8
for line in your_file_A:
if line[8] in col8_of_B:
your_file_C.write(line)
What about awk (since you have the bash tag)?:
awk 'FNR==NR {b[$8]=$0;next} b[$9] {print $0,$9 }' b a > c