Does pandas support read `set` paramater using read_csv - python

I save set parameter using to_csv.
csv file as below.
1,59,"set([17122, 196, 26405, 13032, 39657, 12427, 25133, 35951,
38928, 2 6088, 10258, 49235, 10326, 13176, 30450, 41787, 14084,
46149])",18,19.0,1 1,5.36363649368
Can I use read_csv and return a set type but str
users = pd.read_csv(DATA_PATH + "users_match.csv", dtype={
})

The answer is yes. Your solution
users = pd.read_csv(DATA_PATH + "users_match.csv", header = None)
will already return column 2 as a string as long as you have double quotes around set([...]).
Then use
users[2].apply(lambda x: eval(x))
to convert it back to set

To convert the DataFrame's str object (the string starting with the characters "set") into a built-in Python set object, here is one way:
>>> import pandas as pd
>>> df = pd.read_csv('users_match.csv', header=None)
>>> type(df[2][0])
str
>>> df.set_value(0, 2, eval(df[2][0]))
>>> type(df[2][0])
set

Related

pandas astype python bool instead of numpy.bool_

I need to convert a pandas dataframe to a JSON object.
However
json.dumps(df.to_dict(orient='records'))
fails as the boolean columns are not JSON serializable since they are of type numpy.bool_. Now I've tried df['boolCol'] = df['boolCol'].astype(bool) but that still leaves the type of the fields as numpy.bool_ rather than the pyhton bool which serializes to JSON no problem.
Any suggestions on how to convert the columns without looping through every record and converting it?
Thanks
EDIT:
This is part of a whole sanitization of dataframes of varying content so they can be used as the JSON payload for an API. Hence we currently have something like this:
for cols in df.columns:
if type(df[cols][0]) == pd._libs.tslibs.timestamps.Timestamp:
df[cols] = df[cols].astype(str)
elif type(df[cols]) == numpy.bool_:
df[cols] = df[cols].astype(bool) #still numnpy bool afterwards!
Just tested it out, and the problem seems to be caused by the orient='records' parameter. Seems you have to set it to a option (e.g. list) and convert the results to your preferred format.
import numpy as np
import pandas as pd
column_name = 'bool_col'
bool_df = pd.DataFrame(np.array([True, False, True]), columns=[column_name])
list_repres = bool_df.to_dict('list')
record_repres = [{column_name: values} for values in list_repres[column_name]]
json.dumps(record_repres)
You need to use .astype and set its field dtype to object
See example below:
df = pd.DataFrame({
"time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
"value": [10, 20, 30, 40, None]
})
df['revoked'] = False
df.revoked = df.revoked.astype(bool)
print 'setting astype as bool:', type(df.iloc[0]['revoked'])
df.revoked = df.revoked.astype(object)
print 'setting astype as object:', type(df.iloc[0]['revoked'])
>>> setting astype as bool: <type 'numpy.bool_'>
>>> setting astype as object: <type 'bool'>

ValueError: DataFrame constructor not properly called

I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!
Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.

How do I convert csv string to list in pandas?

I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Python Assign Value to New Column If Contains() Is True

How can I use the str.contains() method to check a column if it contains specific strings and assign a value if true in a different column? Essentially, I'm trying to mimic a CASE WHEN LIKE THEN syntax in SQL but in pandas. Really new to python and pandas and would appreciate any help! Essentially, I want to search 'Source' for either video, audio, default, and if found, then Type would be video, audio, default accordingly. I hope this makes sense!
Source Type
video1393x2352_high video
audiowefxwrwf_low audio
default2325_none default
23234_audio audio
Use the str.extract method ... takes a regular expression as an argument ... returns matched group as a string ...
df['Type'] = df.Source.str.extract('(video|audio|default)')
For some case sensitivity you could add ...
df['Type'] = df.Source.str.lower().str.extract('(video|audio|default)')
Example, including a non match follows ...
In [24]: %paste
import pandas as pd
data = """
Source
video1393x2352_high
audiowefxwrwf_low
default2325_none
23234_audio
complete_crap
AUDIO_upper_case_test"""
from StringIO import StringIO # import from io for python 3
df = pd.read_csv(StringIO(data), header=0, index_col=None)
df['Type'] = df.Source.str.lower().str.extract('(video|audio|default)')
## -- End pasted text --
In [25]: df
Out[25]:
Source Type
0 video1393x2352_high video
1 audiowefxwrwf_low audio
2 default2325_none default
3 23234_audio audio
4 complete_crap NaN
5 AUDIO_upper_case_test audio
Try using numpy.where or pandas.DataFrame.where. Both take a boolean array and conditionally assign based on that.
In [4]: np.where([True, False, True], 3, 4)
Out[4]: array([3, 4, 3])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html
You would construct the boolean array using str.contains, and then pass it to the where method.
Try something like:-
import re
input_values = ['video1393x2352_high', 'audiowefxwrwf_low', 'default2325_none', '23234_audio']
pattern = re.compile('audio|video|default')
res_dict = {}
for input_val in input_values:
type = pattern.findall(input_val)
if type:
res_dict[input_val] = type[0]
print res_dict #{'23234_audio': 'audio', 'audiowefxwrwf_low': 'audio', 'video1393x2352_high': 'video', 'default2325_none': 'default'}

Categories