Reading csv file as dictionary using pandas - python

I have the foll. csv with 1st row as header:
A B
test 23
try 34
I want to read in this as a dictionary, so doing this:
dt = pandas.read_csv('file.csv').to_dict()
However, this reads in the header row as key. I want the values in column 'A' to be the keys. How do I do that i.e. get answer like this:
{'test':'23', 'try':'34'}

dt = pandas.read_csv('file.csv', index_col=1, skiprows=1).T.to_dict()

Duplicating data:
import pandas as pd
from io import StringIO
data="""
A B
test 23
try 34
"""
df = pd.read_csv(StringIO(data), delimiter='\s+')
Converting to dictioanry:
print(dict(df.values))
Will give:
{'try': 34, 'test': 23}

Related

Create DataFrame from String "like" data

I have a text data that look like this:
3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"
I want to transfor it to be table like this:
a b e r
1 2 5 7
23 45 76 76
I've tried to use a pandas data frame for that, but the data size is quite big, like 40 Mb.
So what should I do to solve it?
Sorry for my bad explanation. I hope you can understand what I mean. Thanks!
import os
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("12test.txt"), sep=",", header=None, error_bad_lines=False)
df = pd.DataFrame([row.split('.') for row in a.split('\n')])
print(df)
I've tried this but it doesn't work.
Some errors occurred like "'DataFrame' object has no attribute 'split' ", the data frame containing a string "12test.txt" not the data inside the file, memory problem, etc.
Try:
>>> s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
>>> pd.DataFrame([[x.strip('"') for x in i.split(',')[1:]] for i in s.splitlines()[1:]], columns=[x.strip('"') for x in s.splitlines()[0].split(',')[1:]])
a b e r
0 1 2 5 7
1 23 45 76 76
>>>
Use a list comprehension then convert it to a pandas.DataFrame.
To read files or binary text data you can use StringIO, removing first digit of string and digits alongside \n make a readable input string when pass to read_csv.
import io
import re
import pandas as pd
s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
s = re.sub(r'[\n][0-9]', "\n", s)
df = pd.read_csv(io.StringIO(s))
# remove column generated by first character that contains NAN values
df.drop(df.columns[0], axis=1)

Converting an xlsx file to a dictionary in Python pandas

I am trying to import a dataframe from an xlsx file to Python and then convert this dataframe to a dictionary. This is how my Excel file looks like:
A B
1 a b
2 c d
where A and B are names of columns and 1 and 2 are names of rows.
I want to convert the data frame to a dictionary in python, using pandas. My code is pretty simple:
import pandas as pd
my_dict = pd.read_excel(‘.\inflation.xlsx’, sheet_name = ‘Sheet2’, index_col=0).to_dict()
print(my_dict)
What I want to get is:
{‘a’:’b’, ‘c’:’d’}
But what I get is:
{‘b’:{‘c’:’d’}}
What might be the issue?
This does what is requested:
import pandas as pd
d = pd.read_excel(‘.\inflation.xlsx’, sheet_name = ‘Sheet2’,index_col=0,header=None).transpose().to_dict('records')[0]
print(d)
Output:
{'a': 'b', 'c': 'd'}
The to_dict() function takes an orient parameter which specifies how the data will be manipulated. There are other options if you have more rows.
This should work
import pandas as pd
my_dict = pd.read_excel(‘.\inflation.xlsx’, sheet_name = ‘Sheet2’,header = 0 index_col=None).to_dict('records')
print(my_dict)

I want to put the output of this print(d) function into a pandas dataframe

I am working with the bioservices package in python and I want to take the output of this function and put it into a dataframe using pandas
from bioservices import UniProt
u = UniProt(verbose=False)
d = u.search("yourlist:M20211203F248CABF64506F29A91F8037F07B67D133A278O", frmt="tab", limit=5,
columns="id, entry name")
print(d)
this is the result I am getting, almost like a neat little table
The problem however is I cannot work with the data in this form and I want to put it into a dataframe using pandas
trying this code below does not work and it only returns the error "ValueError: DataFrame constructor not properly called"
import pandas as pd
df = pd.DataFrame(columns= ['Entry','Entry name'],
data=d)
print(df)
Use pd.read_csv, after encapsulating your output in a StringIO (to present a file-like interface):
import io
import pandas as pd
data = 'Entry\tEntry name\na\t1\nb\t2'
io_data = io.StringIO(data)
df = pd.read_csv(io_data, sep='\t')
print(df)
The output is a dataframe:
Entry Entry name
0 a 1
1 b 2
Sample data:
from bioservices import UniProt
import io
u = UniProt(verbose=False)
d = u.search("yourlist:M20211203F248CABF64506F29A91F8037F07B67D133A278O", frmt="tab", limit=5,
columns="id, entry name")
#print(d)
df = pd.read_csv(io.StringIO(d), sep='\t')
print(df)
Entry Entry name
0 Q8TAS1 UHMK1_HUMAN
1 P35916 VGFR3_HUMAN
2 Q96SB4 SRPK1_HUMAN
3 Q6P3W7 SCYL2_HUMAN
4 Q9UKI8 TLK1_HUMAN

Count the number of males and females in a csv file

Suppose I have this csv file named sample.csv:
CODE AGE SEX CITY
---- --- --- ----
E101 25 M New York
E102 42 F New York
E103 31 M Chicago
E104 67 F Chicago
I wish to count the number of males and females in the data. For instance, for this one, the answer would be:
M : 2
F : 2
Where should I start and how should I code it?
You can do this:
import pandas as pd
df = pd.read_csv("sample.csv")
print(f"M : {len(df[df['SEX'] == 'M'])}")
print(f"F : {len(df[df['SEX'] == 'F'])}")
>>> import csv
>>> M,F = 0,0
>>> with open('file.csv') as csvfile:
... data = csv.reader(csvfile)
... for row in data:
... M += 1 if row[2] == "M" else F += 1
Import the CSV file.
section out the 'SEX' column.
import pandas as pd
data = pd.read_csv('sample.csv')
num_males = sum(data['SEX'] == 'M')
num_females = len(data['SEX']) - num_males
Another solution is using the pandas packages to do so.
import pandas as pd
csv_path_file = '' # your csv path file
separator = ';'
df = pd.read_csv(csv_path_file, sep = separator)
df['SEX'].value_counts()
will return a pd.Series object with 'M' and 'F' as index and count as values.
It is also a great workaround for checking wrong data, you'll immediately notice it if you have another category, or missing data.
The simplest way is using Pandas to read data from csv and group by:
import pandas as pd
df = pd.read_csv('sample.csv') // read data from csv
result = df.groupby('sex').size() // use .size() to get the row counts
Output:
sex
f 2
m 2
dtype: int64
After you read from file using either external pandas or built-in csv module, you might builtin module collections' Counter to count occurences, consider example:
import collections
import pandas as pd
df = pd.DataFrame({'CODE':['E101','E102','E103','E104'],'SEX':['M','F','M','F']})
for key, value in collections.Counter(df['SEX']).items():
print(key,":",value)
Output:
M : 2
F : 2
Note I hardcoded data for simplicity. Explanation: collections.Counter is dict-like object, which accept iterable during creation and count occurences in said iterable.

pandas create data frame, floats are objects, how to convert?

I have a text file:
sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442
i import it:
with open('file.txt', 'r') as fo:
notes = next(fo)
headers,*raw_data = [row.strip('\r\n').split('\t') for row in fo] # get column headers and data
names = [row[0] for row in raw_data] # extract first row (variables)
data= np.array([row[1:] for row in raw_data],dtype=float) # get rid of first row
if i then convert it:
s = pd.DataFrame(data,index=names,columns=headers[1:])
the data is recognized as floats. I could get the sample names back as column by s=s.reset_index().
if i do
s = pd.DataFrame(raw_data,columns=headers)
the floats are objects and i cannot perform standard calculations.
How would you make the data frame ? Is it better to import the data as dict ?
BTW i am using python 3.3
You can parse your data file directly into data frame as follows:
df = pd.read_csv('file.txt', sep='\t', index_col='sample')
Which will give you:
value1 value2
sample
A 0.12120 0.2354
B 0.23493 1.3442
[2 rows x 2 columns]
Then, you can do your computations.
To parse such a file, one should use pandas read_csv function.
Below is a minimal example showing the use of read_csv with parameter delim_whitespace set to True
import pandas as pd
from StringIO import StringIO # Python2 or
from io import StringIO # Python3
data = \
"""sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442"""
# Creation of the dataframe
df = pd.read_csv(StringIO(data), delim_whitespace=True)

Categories