Remove numbers and user's stop words from pandas data frame - python

I would like to know how to remove some variables from a dataset, specifically numbers and list of strings. For example.
Test Num
0 bam 132
1 - 65
2 creation 47
3 MAN 32
4 41 831
... ... ...
460 Luchino 21
461 42 4126 7
462 finger 43
463 washing 1
I would like to have something like
Test Num
0 bam 132
2 creation 47
... ... ...
460 Luchino 21
462 finger 43
463 washing 1
where I removed (manually) MAN (it should be included in a list of strings, like a stop word), -, and numbers.
I have tried with isdigit but it is not working so I am sure that there are errors in my code:
df['Text'].where(~df['Text'].str.isdigit())
and for my stop words:
my_stop=['MAN','-']
df['Text'].apply(lambda lst: [x for x in lst if x in my_stop])

If you want to filter you could use .loc
df = df.loc[~df.Text.str.isdigit() & ~df.Text.isin(['MAN']), :]
.where(cond, other) returns a dataframe or series of the same shape as self, but keeps the original values where cond is true and replaces with other where it is false.
Read more in the docs

hi you should try this code :
df[df['Text']!='MAN']

Related

Removing from pandas dataframe all rows having less than 3 characters

I have this dataframe
Word Frequency
0 : 79
1 , 60
2 look 26
3 e 26
4 a 25
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
I would like to remove all words having less than 3 characters.
I tried as follows:
df['Word']=df['Word'].str.findall('\w{3,}').str.join(' ')
but it does not remove them from my datataset.
Can you please tell me how to remove them?
My expected output would be:
Word Frequency
2 look 26
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
Try with
df = df[df['Word'].str.len()>=3]
Instead of attempting a regular expression, you can use .str.len() to get the length of each string of your column. Then you can simply filter based on that length for >= 3
Should look like:
df.loc[df["Word"].str.len() >= 3]
Please Try
df[df.Word.str.len()>=3]

Replace Not Correctly Removing for Object Variable Type

I'm trying to remove the ":30" portion of values in my First variable. The First variable data type is object.
Here are a few examples of of the First variable, and the counts, ignore the counts:
11a 211
7p 178
4p 127
2:30p 112
11:30a 108
1p 107
12p 105
9a 100
10p 85
2p 24
10:30a 12
6p 5
9:30a 2
9p 2
12:30a 2
8p 2
I wrote the following code which runs without any errors; however, when I run the value counts, it still shows times with a ":30". The NewFirst variable dataype is int64.Not quite sure what I'm doing wrong here.
bad_chars = ":30"
DF["NewFirst"] = DF.First.replace(bad_chars,'')
DF["NewFirst"].value_counts()
The desired output would have the NewFirst values like:
11a 211
7p 178
4p 127
2p 112
11a 108
1p 107
12p 105
9a 100
10p 85
2p 24
10a 12
6p 5
9a 2
9p 2
12a 2
8p 2
You shouldn't be looping over the characters in bad_chars. That will remove all 3 and 0 characters, so 10p will become 1p, and 3a will become a.
You should just replace the whole bad_chars string, with no loop.
You also need to use the .str accessor.
DF["NewFirst"] = DF["First"].str.replace(bad_chars,'')

pandas group by multiple columns and remove rows based on multiple conditions

I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132

Picking data randomly without a repeat in the index and create a new list out of it

My program needs to pick values randomly without repeating them. After that, the program will assign them random variables.
Assume this is the data:
[input] data
[output]
0
0 770000.000
1 529400.000
2 780000.000
3 731300.000
4 935000.000
5 440000.000
6 634120.000
7 980000.000
8 600000.000
9 770000.000
10 600000.000
11 536613.000
12 660000.000
13 850000.000
14 563600.000
15 985000.000
16 600000.000
17 770000.000
18 957032.000
19 252000.000
20 397000.000
21 218750.000
22 785578.000
As you can see, the data contains repeated numbers in the index 0, 9, and 17. These numbers must not be ignored as the index is different.
I could not find any way to solve my problem. I had many attempts like using data.iloc[0]but, I recieved this
error ValueError: The truth value of an array with more than one
element is ambiguous. Use a.any() or a.all()
Or, in my other attempts, the data was reduced as the program excluded some similar data.
In my first attempt, I used the following code
Col_list = []
def Grab(repeat):
for x in range(FixedRange):
letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
Three = [random.choice(letters) + \
random.choice(letters) + \
random.choice(letters)]
A_Slice = random.randint(1, Total_Range_of_Data)
[Col_list.append(data[A_Slice:A_Slice + 200]),
Col_list.append(Three*len(data[A_Slice:A_Slice + 200]))]
Col_list1 = pd.DataFrame(Col_list).T
Col_listFinal = Col_list1
Grab(0)
and the output will give something like
. . . .
. . . .
190 1.06934e+06 kCn 3.46638e+06 EmV ... 514564 LLl 450000 hfX
191 250000 kCn 1.37e+06 EmV ... 1.00430e+06 LLl 468305 hfX
192 741088 kCn 1.25e+06 EmV ... 312032 LLl 520000 hfX
193 427500 kCn 726700 EmV ... 1.0204e+06 LLl 495750 hfX
194 969600 kCn 853388 EmV ... 139300 LLl 530000 hfX
195 388556 kCn 1.21e+06 EmV ... 437500 LLl 598520 hfX
196 2.045e+06 kCn 1.53636e+06 EmV ... 547835 LLl 538250 hfX
197 435008 kCn 752700 EmV ... 712400 LLl 326000 hfX
198 6.15566e+06 kCn 1.56282e+06 EmV ... 1.385e+06 LLl 480000 hfX
199 551650 kCn 1.222e+06 EmV ... 771512 LLl 495750 hfX
But this is not helpful, as it is random and it may take some values more than once. Any suggestion to solve the problem?
by the way, the desired output must be something similar to the one above but without duplicates.
As #peter-leimbigler said, df.sample gets you most of the way there.
df.sample(10))
data
4 935000.0
13 850000.0
20 397000.0
7 980000.0
22 785578.0
18 957032.0
19 252000.0
10 600000.0
5 440000.0
0 770000.0
This may repeat certain values, if those values exist at multiple index locations, but it shouldn't select the same index location more than once.
If you only want to sample unique values, you can use df[column].unique, although you can't sample that directly.
unique_series = df["data"].unique()
df2 = pd.DataFrame(list(unique_series), columns=["data"])
data
0 770000.0
1 529400.0
2 780000.0
3 731300.0
4 935000.0
5 440000.0
6 634120.0
7 980000.0
8 600000.0
9 536613.0
10 660000.0
11 850000.0
12 563600.0
13 985000.0
14 957032.0
15 252000.0
16 397000.0
17 218750.0
18 785578.0
You can pick random indices without replacement using numpy.random.choice with the replace=False keyword arg. Here's how you would pick n random values from data without repeated indices:
import numpy as np
drand = data.iloc[np.random.choice(np.arange(data.size), n, replace=False)]

Efficiently finding intersecting regions in two huge dictionaries

I wrote a piece of code that finds common ID's in line[1] of two different files.My input file is huge (2 mln lines). If I split it into many small files it gives me more intersecting ID's, while if I throw the whole file to run, much less. I cannot figure out why, can you suggest me what is wrong and how to improve this code to avoid the problem?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in dictB:
if key in dictA:
output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])
My file1 is sorted by line[0] and has 0-15 lines,
contig17 GRMZM2G052619_P03 98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33 AT2G41790.1 98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98 GRMZM5G888620_P01 87 470 1 0 17 28 78.8 1 127 7 420 2 522 18
contig102 GRMZM5G886789_P02 73 115 1 0 34 45 78.8 0 134 5 421 0 456 50
contig123 AT3G57470.1 83 201 2 1 12 43 78.8 0 134 9 420 0 305 50
My file2 is not sorted and has 0-10 line,
GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525 1
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589 4
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0
My desired output,
contig17 GRMZM2G052619_P03 GO:0043531 ADP binding molecular_function PF07525
contig98 GRMZM5G888620_P01 GO:0011551 DNA binding molecular_function PF07589
contig102 GRMZM5G886789_P02 GO:0055516 ADP binding molecular_function PF07526
I really recommend you to use PANDAS to cope with this kind of problem.
for proof that can be simply done with pandas:
import pandas as pd #install this, and read de docs
from StringIO import StringIO #You dont need this
#simulating a reading the file
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""
#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""
#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file),
header=None,
sep=" ",
names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file),
header=None,
sep=" ",
names=['d', 'e', 'f'])
#this is the hard bit. Here I am using a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames.
my_df = s_df[s_df.e.isin(f_df.b)]
Output:
Out[180]:
d e f
0 y GRMZM2G052619_P03 y
1 y GRMZM5G888620_P01 y
2 y GRMZM5G886789_P02 y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")
chers!
This is almost the same but within a function.
#Creates a function to do the reading for each file
def read_store(file_, dictio_):
"""Given a file name and a dictionary stores the values
of the file in a dictionary by its value on the column provided."""
import re
with open(file_,'r') as file_0:
lines_file_0 = fileA.readlines()
for line in lines_file_0:
ID = re.findall("^.+\s+(\w+)", line)
#I couldn't check it but it should match whatever is after a separate
# character that has letters, numbers or underscore
dictio_[ID] = line
To use do:
file1 = {}
read_store("file1.txt", file1)
And then compare it normally as you do, but I would to use \s instead of \t to split. Even though it will split also between words, but that is easy to rejoin with " ".join(DictA[1:5])

Categories