randomly subsampling lines in a file - python

I have a file like this:
Tree 5
Jaguar 9
Cat 23
Monkey 12
Gorilla 67
Is possible to randomly subsample 3 of these lines?
For example:
Jaguar 9
Gorilla 67
Tree 5
or
Monkey 12
Tree 5
Cat 23
etc.?

Using random.sample on readlines:
import random
random.sample(open('foo.txt', 'r').readlines(), 3)

Related

Rounding up pandas column to nearest n unit value

I have a 200000 row dataframe that looks like this
df =
index
name
d2b(m)
0
Jon
199.9
1
Amy
29
2
Fyn
19
3
Luc
30
4
And
76
5
Pia
90
I am writing a function to classify the "distance to bus stop (d2b)" column into a new column for every 10 meters, expecting:
index
name
d2b (m)
class (<= x meters)
0
Jon
199.9
200m
1
Amy
29
30m
2
Fyn
19
20m
3
Luc
33
40m
4
And
76
80m
5
Pia
90
90m
Code that works (updated):
numpy.ceil(data["d2b (m)"]/10)*10
This is one way of achieving this:
import math
df['class (<= x meters)'] = math.ceil(df[d2b(m)]/10)*10

Transposing the columns and organazing unstructured csv files with pandas

I have too many messy csv files and I am trying to extract information from them. There are random number of unnecessary columns at the beginning of each file. However, the columns that I am interested in are always have the same index. Let me explain in it through an example:
RandomInfo XX
Random2 ZZ
Random3 VV
Random4 KK
Companyname: Apple
VisitsMay ImpressionsMay VisitsApril ImpressionsApril...
Information
International 100 250 90 260
Local 10 22 12 26
With Proxy 5 12 8 16
I want to convert this to:
Companyname Month International Local With Proxy
Apple VistsMay 100 10 5
Apple ImpressionsMay 250 22 12
Apple VisitsApril 90 12 8
Apple ImpressionsApril 260 26 16
Example files here

.csv loading repeats all entries from one column in every cell

I am attempting to load a given csv file with the folowing structure:
Then, I'd like to join all the words with the same "Sent_ID" into one row, with the following code:
train = pd.read_csv("train.csv")
# Create a dataframe of sentences.
sentence_df = pd.DataFrame(train["Sent_ID"].drop_duplicates(), columns=["Sent_ID", "Sentence", "Target"])
for _, row in train.iterrows():
print(str(row["Word"]))
sentence_df.loc[sentence_df["Sent_ID"] == row["Sent_ID"], ["Sentence"]] = str(row["Word"])
However, the result of the print(str(row["Word"])) is:
Name: Word, Length: 4543833, dtype: object
0 Obesity
1 in
2 Low-
3 and
4 Middle-Income
5 Countries
...
i.e every single word in the column, for any given row. This occurs for all rows.
Printing the entire row gives:
id 89
Doc_ID 1
Sent_ID 4
Word 0 Obesity\n1 ...
tag O
Name: 88, dtype: object
This again suggests that every element of the "Word" column is present in each cell. (The 88th entry is not "Obesity\n1" in the .csv file.
I have tried changing the quoting argument in the read_csv function, as well as manually inserting the headers in the names argument, to no avail.
How do I ensure each Dataframe entry only contains its own word?
I've added a pastebin with some of the samples here (the pastebin will expire a week after this edit).
Building on #Aravinds answer, OP wanted a working example:
from io import StringIO
csv = StringIO('''
<paste csv snippet here>
'''
df = pd.read_csv(csv)
# Print first 5 rows
print(df.head())
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
Now we have the data loaded as a pandas.DataFrame We can use the method to combine the words into sentences.
df = df.groupby('Sent_ID').Word.apply(' '.join).reset_index()
print(df)
Sent_ID Word
0 1 Obesity in Low- and Middle-Income Countries : ...
1 2 We have reviewed the distinctive features of e...
2 3 Obesity is rising in every region of the world...
3 4 In LMICs , overweight is higher in women compa...
4 5 Overweight occurs alongside persistent burdens...
5 6 Changes in the global diet and physical activi...
6 7 Emerging risk factors include environmental co...
7 8 Data on effective strategies to prevent the on...
8 9 Expanding the research in this area is a key p...
9 10 MICROCEPHALIA VERA
10 11 Excellent reproducibility of laser speckle con...
11 12 We compared the inter-day reproducibility of p...
12 13 We also tested whether skin blood flow assessm...
13 14 Skin blood flow was evaluated during PORH and ...
14 15 Data are expressed as cutaneous vascular condu...
15 16 Reproducibility is expressed as within subject...
16 17 Twenty-eight healthy participants were enrolle...
17 18 The reproducibility of the PORH peak CVC was b...
18 19 Inter-day reproducibility of the LTH plateau w...
19 20 Finally , we observed significant correlation ...
20 21 The recently developed LSCI technique showed v...
21 22 Moreover , we showed significant correlation b...
22 23 However , more data are needed to evaluate the...
23 24 Positive inotropic action of cholinesterase on...
24 25 The putative chloride channel hCLCA2 has a sin...
25 26 Calcium-activated chloride channel ( CLCA ) pr...
26 27 Genetic and electrophysiological studies have ...
27 28 The human CLCA2 protein is expressed as a 943-...
28 29 Earlier investigations of transmembrane geomet...
29 30 However , analysis by the more recently derive...
Use groupby()
df = df.groupby('Sent_ID')['Word'].apply(' '.join).reset_index()
You can group by multiple columns as a list. Like so
df.groupby(['Doc_ID','Sent_ID','tag'])

Pandas: combine columns without duplicates/ find unique words after combining

I have a dataframe where I would like to concatenate certain columns.
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
For example, if I had a data frame such as:
pd.read_csv("animal.csv")
animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
I want to combine the columns but retain only unique information from each of the strings.
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
I combine the columns by doing the following
animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81
Row 1 is fine, but the other rows, of course, contain duplicates as described above.
The output I would desire is:
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81
or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:
detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python
but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?
I have the solution in R but want to recode in python.
Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)
You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:
animals["detail"] = animals["animal1"].map(str) + ' ' +
animals["animal2"].map(str) + ' ' +
animals["label"].map(str)
animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Also is possible join values in apply:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Solution with set, but it change order:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dolphin 19 dog
2 dog cat cat 72 cat dog 72
3 pilchard 26 koala 26 26 pilchard koala
4 newt bat 81 bat 81 bat 81 newt
If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.
df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
.apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))
df
Out[46]:
animal1 animal2 label detail
0 1 cat dog dolphin 19 1 cat dog dolphin 19
1 2 dog cat cat 72 2 dog cat 72
2 3 pilchard 26 koala 26 3 pilchard 26 koala
3 4 newt bat 81 bat 81 4 newt bat 81
I'd suggest to remove the duplicates at the end of the process by using python set.
here is an example function to do so:
def dedup(value):
words = set(value.split(' '))
return ' '.join(words)
That works like this:
val = 'dog cat cat 81'
print dedup(val)
81 dog cat
in case you want the details ordered you can use oredereddict from collections or pd.unique instead of set.
then just apply it (similar to map) on your details columns for the desired result:
animals.detail = animals.detail.apply(dedup)

Finding the averages from columns

I'm using this txt file named Gradedata.txt and it looks like this:
Sarah K.,10,9,7,9,10,20,19,19,45,92
John M.,9,9,8,9,8,20,20,18,43,95
David R.,8,7,7,9,6,18,17,17,40,83
Joan A.,9,10,10,10,10,20,19,20,47,99
Nick J.,9,7,10,10,10,20,20,19,46,98
Vicki T.,7,7,8,9,9,17,18,19,44,88
I'm looking for the averages of each column. Each column has it's own title (Homework #1, Homework #2, etc. in that order). What I am trying to do should look exactly like this:
Homework #1 8.67
Homework #2 8.17
Homework #3 8.33
Homework #4 9.33
Homework #5 8.83
Quiz #1 19.17
Quiz #2 18.83
Quiz #3 18.67
Midterm #1 44.17
Final #1 92.50
Here is my attempt at accomplishing this task:
with open("GradeData.txt", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
I'm getting errors and also I realize I have to name each assignment average. Need some help here. I appreciate it.
numpy can chew this up in one line:
>>> np.loadtxt('Gradedata.txt', delimiter=',', usecols=range(1,11)).mean(axis=0)
array([ 8.66666667, 8.16666667, 8.33333333, 9.33333333,
8.83333333, 19.16666667, 18.83333333, 18.66666667,
44.16666667, 92.5 ])
Just transpose and use statistics.mean to get the average, skipping the first col:
import csv
from itertools import islice
from statistics import mean
with open("in.txt") as f:
for col in islice(zip(*csv.reader(f)), 1, None):
print(mean(map(float,col)))
Which will give you:
8.666666666666666
8.166666666666666
8.333333333333334
9.333333333333334
8.833333333333334
19.166666666666668
18.833333333333332
18.666666666666668
44.166666666666664
92.5
If the columns are actually named and you want to pair them:
import csv
from itertools import islice
from statistics import mean
with open("in.txt") as f:
# get column names
cols = next(f).split(",")
for col in islice(zip(*csv.reader(f)),1 ,None):
# keys are column names, values are averages
data = dict(zip(cols[1:],mean(map(float,col))))
Or using pandas.read_csv:
import pandas as pd
df = pd.read_csv("in.txt",index_col=0,header=None)
print(df)
print(df.mean(axis=0))
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
1 8.666667
2 8.166667
3 8.333333
4 9.333333
5 8.833333
6 19.166667
7 18.833333
8 18.666667
9 44.166667
10 92.500000
dtype: float64

Categories