How do I convert csv string to list in pandas? - python
I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
Related
How to fix TypeError: string indices must be integers
I'm passing dataframe from mapInPandas function in pyspark. so I need all values of ID column should be seperated by comma(,) like this 'H57R6HU87','A1924334','496A4806' x1['ID'] looks like this H57R6HU87 A1924334 496A4806' Here is my code to get unique ID's, I am getting TypeError: string indices must be integers # batch_iter= cust.toPandas() for x1 in batch_iter: IDs= ','.join(f"'{i}'" for i in x1['ID'].unique())
You probably don't need a loop, try: batch_iter = cust.toPandas() IDs = ','.join(f"'{i}'" for i in batch_iter['ID'].unique()) Or you can try using Spark functions only: df2 = df.select(F.concat_ws(',', F.collect_set('ID')).alias('ID')) If you want to use mapInPandas: def pandas_func(iter): for x1 in iter: IDs = ','.join(f"'{i}'" for i in x1['ID'].unique()) yield pd.DataFrame({'ID': IDs}, index=[0]) df.mapInPandas(pandas_func) # But I suspect you want to do this instead: # df.repartition(1).mapInPandas(pandas_func)
What is the best way to convert a string in a pandas dataframe to a list?
Basically I have a dataframe with lists that have been read in as strings and I would like to convert them back to lists. Below shows what I am currently doing but I m still learning and feel like there must be a better (more efficient/Pythonic) way to go about this. Any help/constructive criticism would be much appreciated! import pandas as pd import ast df = pd.DataFrame(data=['[-1,0]', '[1]', '[1,2]'], columns = ['example']) type(df['example'][0]) >> str n = df.shape[0] temp = [] temp2 = [] for i in range(n): temp = (ast.literal_eval(df['example'][i])) temp2.append(temp) df['new_col_lists'] = temp2 type(df['new_col_lists'][0]) >> list
Maybe you could use a map: df['example'] = df['example'].map(ast.literal_eval) With pandas, there is almost always a way to avoid the for loop.
You can use .apply Ex: import pandas as pd import ast df = pd.DataFrame(data=['[-1,0]', '[1]', '[1,2]'], columns = ['example']) df['example'] = df['example'].apply(ast.literal_eval) print( type(df['example'][0]) ) Output: <type 'list'>
You could use apply with a lambda which splits and converts your strings: df['new_col_lists'] = df['example'].apply(lambda s: [int(v.strip()) for v in s[1:-1].split(',')]) Use float cast instead of int if needed.
ValueError: DataFrame constructor not properly called
I am trying to create a dataframe with Python, which works fine with the following command: df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"])) but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. : r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"] df_test2 = DataFrame(index = idx, data=(r6)) I expect this is the same and it should work? But I get: ValueError: DataFrame constructor not properly called!
Reason for the error: It seems a string representation isn't satisfying enough for the DataFrame constructor Fix/Solutions: import ast # convert the string representation to a dict dict = ast.literal_eval(r6) # and use it as the input df_test2 = DataFrame(index = idx, data=(dict)) which will solve the error.
Does pandas support read `set` paramater using read_csv
I save set parameter using to_csv. csv file as below. 1,59,"set([17122, 196, 26405, 13032, 39657, 12427, 25133, 35951, 38928, 2 6088, 10258, 49235, 10326, 13176, 30450, 41787, 14084, 46149])",18,19.0,1 1,5.36363649368 Can I use read_csv and return a set type but str users = pd.read_csv(DATA_PATH + "users_match.csv", dtype={ })
The answer is yes. Your solution users = pd.read_csv(DATA_PATH + "users_match.csv", header = None) will already return column 2 as a string as long as you have double quotes around set([...]). Then use users[2].apply(lambda x: eval(x)) to convert it back to set
To convert the DataFrame's str object (the string starting with the characters "set") into a built-in Python set object, here is one way: >>> import pandas as pd >>> df = pd.read_csv('users_match.csv', header=None) >>> type(df[2][0]) str >>> df.set_value(0, 2, eval(df[2][0])) >>> type(df[2][0]) set
pandas read_json: "If using all scalar values, you must pass an index"
I have some difficulty in importing a JSON file with pandas. import pandas as pd map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json') This is the error that I get: ValueError: If using all scalar values, you must pass an index The file structure is simplified like this: {"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685} It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series') That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count'). You can also do something like this: import json with open('people_wiki_map_index_to_word.json', 'r') as f: data = json.load(f) Now data is a dictionary. You can pass it to a dataframe constructor like this: df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models. df = pd.DataFrame([data])
I think what is happening is that the data in map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json') is being read as a string instead of a json {"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685} is actually '{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}' Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing The best way is to do a json loads on the string to convert it to a dict and load it into pandas myfile=f.read() jsonData=json.loads(myfile) df=pd.DataFrame(data)
{ "biennials": 522004, "lb915": 116290 } df = pd.read_json('values.json') As pd.read_json expects a list { "biennials": [522004], "lb915": [116290] } for a particular key, it returns an error saying If using all scalar values, you must pass an index. So you can resolve this by specifying 'typ' arg in pd.read_json map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True. The file is read as a json object per line. import pandas as pd map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True) If fixed the following errors I encountered especially when some of the json files have only one value: ValueError: If using all scalar values, you must pass an index JSONDecodeError: Expecting value: line 1 column 1 (char 0) ValueError: Trailing data
For example cat values.json { name: "Snow", age: "31" } df = pd.read_json('values.json') Chances are you might end up with this Error: if using all scalar values, you must pass an index Pandas looks up for a list or dictionary in the value. Something like cat values.json { name: ["Snow"], age: ["31"] } So try doing this. Later on to convert to html tohtml() df = pd.DataFrame([pd.read_json(report_file, typ='series')]) result = df.to_html()
I solved this by converting it into an array like so [{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]