How do I convert csv string to list in pandas?

How do I convert csv string to list in pandas? - python

I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?

You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})

This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))

An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})

Related

How to fix TypeError: string indices must be integers

I'm passing dataframe from mapInPandas function in pyspark. so I need all values of ID column should be seperated by comma(,) like this 'H57R6HU87','A1924334','496A4806'
x1['ID'] looks like this
H57R6HU87
A1924334
496A4806'
Here is my code to get unique ID's, I am getting TypeError: string indices must be integers
# batch_iter= cust.toPandas()
for x1 in batch_iter:
IDs= ','.join(f"'{i}'" for i in x1['ID'].unique())

You probably don't need a loop, try:
batch_iter = cust.toPandas()
IDs = ','.join(f"'{i}'" for i in batch_iter['ID'].unique())
Or you can try using Spark functions only:
df2 = df.select(F.concat_ws(',', F.collect_set('ID')).alias('ID'))
If you want to use mapInPandas:
def pandas_func(iter):
for x1 in iter:
IDs = ','.join(f"'{i}'" for i in x1['ID'].unique())
yield pd.DataFrame({'ID': IDs}, index=[0])
df.mapInPandas(pandas_func)
# But I suspect you want to do this instead:
# df.repartition(1).mapInPandas(pandas_func)

What is the best way to convert a string in a pandas dataframe to a list?

Basically I have a dataframe with lists that have been read in as strings and I would like to convert them back to lists.
Below shows what I am currently doing but I m still learning and feel like there must be a better (more efficient/Pythonic) way to go about this. Any help/constructive criticism would be much appreciated!
import pandas as pd
import ast
df = pd.DataFrame(data=['[-1,0]', '[1]', '[1,2]'], columns = ['example'])
type(df['example'][0])
>> str
n = df.shape[0]
temp = []
temp2 = []
for i in range(n):
temp = (ast.literal_eval(df['example'][i]))
temp2.append(temp)
df['new_col_lists'] = temp2
type(df['new_col_lists'][0])
>> list

Maybe you could use a map:
df['example'] = df['example'].map(ast.literal_eval)
With pandas, there is almost always a way to avoid the for loop.

You can use .apply
Ex:
import pandas as pd
import ast
df = pd.DataFrame(data=['[-1,0]', '[1]', '[1,2]'], columns = ['example'])
df['example'] = df['example'].apply(ast.literal_eval)
print( type(df['example'][0]) )
Output:
<type 'list'>

You could use apply with a lambda which splits and converts your strings:
df['new_col_lists'] = df['example'].apply(lambda s: [int(v.strip()) for v in s[1:-1].split(',')])
Use float cast instead of int if needed.

ValueError: DataFrame constructor not properly called

I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!

Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.

Does pandas support read `set` paramater using read_csv

I save set parameter using to_csv.
csv file as below.
1,59,"set([17122, 196, 26405, 13032, 39657, 12427, 25133, 35951,
38928, 2 6088, 10258, 49235, 10326, 13176, 30450, 41787, 14084,
46149])",18,19.0,1 1,5.36363649368
Can I use read_csv and return a set type but str
users = pd.read_csv(DATA_PATH + "users_match.csv", dtype={
})

The answer is yes. Your solution
users = pd.read_csv(DATA_PATH + "users_match.csv", header = None)
will already return column 2 as a string as long as you have double quotes around set([...]).
Then use
users[2].apply(lambda x: eval(x))
to convert it back to set

To convert the DataFrame's str object (the string starting with the characters "set") into a built-in Python set object, here is one way:
>>> import pandas as pd
>>> df = pd.read_csv('users_match.csv', header=None)
>>> type(df[2][0])
str
>>> df.set_value(0, 2, eval(df[2][0]))
>>> type(df[2][0])
set

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.

Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})

You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])

I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)

{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')

For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data

For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()

I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I convert csv string to list in pandas? - python

You can specify a converter for the Sequence column: converters: dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels train = pd.read_csv(training_data_file, converters={'Sequence': clean})

An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected. def clean(x): return literal_eval(x) train = pd.read_csv(training_data_file, converters={'Sequence': clean})

Related

How to fix TypeError: string indices must be integers

What is the best way to convert a string in a pandas dataframe to a list?

ValueError: DataFrame constructor not properly called

Does pandas support read `set` paramater using read_csv

pandas read_json: "If using all scalar values, you must pass an index"

Categories

Resources