Suppose I have this csv file named sample.csv:
CODE AGE SEX CITY
---- --- --- ----
E101 25 M New York
E102 42 F New York
E103 31 M Chicago
E104 67 F Chicago
I wish to count the number of males and females in the data. For instance, for this one, the answer would be:
M : 2
F : 2
Where should I start and how should I code it?
You can do this:
import pandas as pd
df = pd.read_csv("sample.csv")
print(f"M : {len(df[df['SEX'] == 'M'])}")
print(f"F : {len(df[df['SEX'] == 'F'])}")
>>> import csv
>>> M,F = 0,0
>>> with open('file.csv') as csvfile:
... data = csv.reader(csvfile)
... for row in data:
... M += 1 if row[2] == "M" else F += 1
Import the CSV file.
section out the 'SEX' column.
import pandas as pd
data = pd.read_csv('sample.csv')
num_males = sum(data['SEX'] == 'M')
num_females = len(data['SEX']) - num_males
Another solution is using the pandas packages to do so.
import pandas as pd
csv_path_file = '' # your csv path file
separator = ';'
df = pd.read_csv(csv_path_file, sep = separator)
df['SEX'].value_counts()
will return a pd.Series object with 'M' and 'F' as index and count as values.
It is also a great workaround for checking wrong data, you'll immediately notice it if you have another category, or missing data.
The simplest way is using Pandas to read data from csv and group by:
import pandas as pd
df = pd.read_csv('sample.csv') // read data from csv
result = df.groupby('sex').size() // use .size() to get the row counts
Output:
sex
f 2
m 2
dtype: int64
After you read from file using either external pandas or built-in csv module, you might builtin module collections' Counter to count occurences, consider example:
import collections
import pandas as pd
df = pd.DataFrame({'CODE':['E101','E102','E103','E104'],'SEX':['M','F','M','F']})
for key, value in collections.Counter(df['SEX']).items():
print(key,":",value)
Output:
M : 2
F : 2
Note I hardcoded data for simplicity. Explanation: collections.Counter is dict-like object, which accept iterable during creation and count occurences in said iterable.
Related
I have a text data that look like this:
3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"
I want to transfor it to be table like this:
a b e r
1 2 5 7
23 45 76 76
I've tried to use a pandas data frame for that, but the data size is quite big, like 40 Mb.
So what should I do to solve it?
Sorry for my bad explanation. I hope you can understand what I mean. Thanks!
import os
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("12test.txt"), sep=",", header=None, error_bad_lines=False)
df = pd.DataFrame([row.split('.') for row in a.split('\n')])
print(df)
I've tried this but it doesn't work.
Some errors occurred like "'DataFrame' object has no attribute 'split' ", the data frame containing a string "12test.txt" not the data inside the file, memory problem, etc.
Try:
>>> s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
>>> pd.DataFrame([[x.strip('"') for x in i.split(',')[1:]] for i in s.splitlines()[1:]], columns=[x.strip('"') for x in s.splitlines()[0].split(',')[1:]])
a b e r
0 1 2 5 7
1 23 45 76 76
>>>
Use a list comprehension then convert it to a pandas.DataFrame.
To read files or binary text data you can use StringIO, removing first digit of string and digits alongside \n make a readable input string when pass to read_csv.
import io
import re
import pandas as pd
s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
s = re.sub(r'[\n][0-9]', "\n", s)
df = pd.read_csv(io.StringIO(s))
# remove column generated by first character that contains NAN values
df.drop(df.columns[0], axis=1)
I am working with the bioservices package in python and I want to take the output of this function and put it into a dataframe using pandas
from bioservices import UniProt
u = UniProt(verbose=False)
d = u.search("yourlist:M20211203F248CABF64506F29A91F8037F07B67D133A278O", frmt="tab", limit=5,
columns="id, entry name")
print(d)
this is the result I am getting, almost like a neat little table
The problem however is I cannot work with the data in this form and I want to put it into a dataframe using pandas
trying this code below does not work and it only returns the error "ValueError: DataFrame constructor not properly called"
import pandas as pd
df = pd.DataFrame(columns= ['Entry','Entry name'],
data=d)
print(df)
Use pd.read_csv, after encapsulating your output in a StringIO (to present a file-like interface):
import io
import pandas as pd
data = 'Entry\tEntry name\na\t1\nb\t2'
io_data = io.StringIO(data)
df = pd.read_csv(io_data, sep='\t')
print(df)
The output is a dataframe:
Entry Entry name
0 a 1
1 b 2
Sample data:
from bioservices import UniProt
import io
u = UniProt(verbose=False)
d = u.search("yourlist:M20211203F248CABF64506F29A91F8037F07B67D133A278O", frmt="tab", limit=5,
columns="id, entry name")
#print(d)
df = pd.read_csv(io.StringIO(d), sep='\t')
print(df)
Entry Entry name
0 Q8TAS1 UHMK1_HUMAN
1 P35916 VGFR3_HUMAN
2 Q96SB4 SRPK1_HUMAN
3 Q6P3W7 SCYL2_HUMAN
4 Q9UKI8 TLK1_HUMAN
I have a tsv file containing an array which has been read using read_csv().
The dtype of the array is shown as dtype: object. How do I read it and access it as an array?
For example:
df=
id values
1 [0,1,0,3,5]
2 [0,0,2,3,4]
3 [1,1,0,2,3]
4 [2,4,0,3,5]
5 [3,5,0,3,5]
Currently I am unpacking it as below:
for index,row in df.iterrows():
string = row['col2']
string=string.replace('[',"")
string=string.replace(']',"")
v1,v2,v3,v4,v5=string.split(",")
v1=int(v1)
v2=int(v2)
v3=int(v3)
v4=int(v4)
v5=int(v5)
Is there any alternative to this?
I want to do this because I want to create another column in the dataframe taking the average of all the values.
Adding additional details:col2
My tsv file looks as below:
id values
1 [0,1,0,3,5]
2 [0,0,2,3,4]
3 [1,1,0,2,3]
4 [2,4,0,3,5]
5 [3,5,0,3,5]
I am reading the tsv file as follows:
df=pd.read_csv('tsv_file_name.tsv',sep='\t', header=0)
You can use json to simplify your parsing:
import json
df['col2'] = df.col2.apply(lambda t: json.loads(t))
edit: following your comment, getting the average is easy:
# using numpy
df['col2_mean'] df.col2.apply(lambda t: np.array(t).mean())
# by hand
df['col2_mean'] df.col2.apply(lambda t: sum(t)/len(t))
import csv
with open('myfile.tsv) as tsvfile:
line = csv.reader(tsvfile, delimiter='...')
...
OR
from pandas import DataFrame
df = DataFrame.from_csv("myfile.tsv", sep="...")
I have the foll. csv with 1st row as header:
A B
test 23
try 34
I want to read in this as a dictionary, so doing this:
dt = pandas.read_csv('file.csv').to_dict()
However, this reads in the header row as key. I want the values in column 'A' to be the keys. How do I do that i.e. get answer like this:
{'test':'23', 'try':'34'}
dt = pandas.read_csv('file.csv', index_col=1, skiprows=1).T.to_dict()
Duplicating data:
import pandas as pd
from io import StringIO
data="""
A B
test 23
try 34
"""
df = pd.read_csv(StringIO(data), delimiter='\s+')
Converting to dictioanry:
print(dict(df.values))
Will give:
{'try': 34, 'test': 23}
I have a text file:
sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442
i import it:
with open('file.txt', 'r') as fo:
notes = next(fo)
headers,*raw_data = [row.strip('\r\n').split('\t') for row in fo] # get column headers and data
names = [row[0] for row in raw_data] # extract first row (variables)
data= np.array([row[1:] for row in raw_data],dtype=float) # get rid of first row
if i then convert it:
s = pd.DataFrame(data,index=names,columns=headers[1:])
the data is recognized as floats. I could get the sample names back as column by s=s.reset_index().
if i do
s = pd.DataFrame(raw_data,columns=headers)
the floats are objects and i cannot perform standard calculations.
How would you make the data frame ? Is it better to import the data as dict ?
BTW i am using python 3.3
You can parse your data file directly into data frame as follows:
df = pd.read_csv('file.txt', sep='\t', index_col='sample')
Which will give you:
value1 value2
sample
A 0.12120 0.2354
B 0.23493 1.3442
[2 rows x 2 columns]
Then, you can do your computations.
To parse such a file, one should use pandas read_csv function.
Below is a minimal example showing the use of read_csv with parameter delim_whitespace set to True
import pandas as pd
from StringIO import StringIO # Python2 or
from io import StringIO # Python3
data = \
"""sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442"""
# Creation of the dataframe
df = pd.read_csv(StringIO(data), delim_whitespace=True)