ValueError: DataFrame constructor not properly called - python

I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!

Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.

Related

How to fix TypeError: string indices must be integers

I'm passing dataframe from mapInPandas function in pyspark. so I need all values of ID column should be seperated by comma(,) like this 'H57R6HU87','A1924334','496A4806'
x1['ID'] looks like this
H57R6HU87
A1924334
496A4806'
Here is my code to get unique ID's, I am getting TypeError: string indices must be integers
# batch_iter= cust.toPandas()
for x1 in batch_iter:
IDs= ','.join(f"'{i}'" for i in x1['ID'].unique())
You probably don't need a loop, try:
batch_iter = cust.toPandas()
IDs = ','.join(f"'{i}'" for i in batch_iter['ID'].unique())
Or you can try using Spark functions only:
df2 = df.select(F.concat_ws(',', F.collect_set('ID')).alias('ID'))
If you want to use mapInPandas:
def pandas_func(iter):
for x1 in iter:
IDs = ','.join(f"'{i}'" for i in x1['ID'].unique())
yield pd.DataFrame({'ID': IDs}, index=[0])
df.mapInPandas(pandas_func)
# But I suspect you want to do this instead:
# df.repartition(1).mapInPandas(pandas_func)

Problem importing data from Excel to Treeview

I created a treeview to display the data imported from an excel file.
def afficher():
fichier = r"*.xlsx"
df = pd.read_excel(fichier)
for row in df:
refOF = row['refOF']
refP = row['refP']
refC = row['refC']
nbreP = row['nbreP']
dateFF = row['dateFF']
self.ordreF.insert("", 0, values=(refOF, refP, refC, nbreP, dateFF))
but I encounter the following error:
refOF = row['refOF']
TypeError: string indices must be integers
please tell me how I can solve this problem.
Another way is replacing the original for loop with the following:
for tup in df[['refOF', 'refP', 'refC', 'nbreP', 'dateFF']].itertuples(index=False, name=None):
self.ordreF.insert("", 0, values=tup)
It works because df.itertuples(index=False, name=None) returns a regular tuple without index in the assigned column order. The tuple can be fed into the values= argument directly.
With your loop you are actually not iterating over the rows, but over the column names. That is the reason for the error message, because row is the string with the colum name and if you use [] you need to specify an integer or an integer based slice, but not a string.
To make your code work, you would just need to modify your code a bit to iterate over the rows:
def afficher():
fichier = r"*.xlsx"
df = pd.read_excel(fichier)
for idx, row in df.iterrows():
refOF = row['refOF']
refP = row['refP']
refC = row['refC']
nbreP = row['nbreP']
dateFF = row['dateFF']
self.ordreF.insert("", 0, values=(refOF, refP, refC, nbreP, dateFF))

Problems transforming data in a dataframe

I've written the function (tested and working) below:
import pandas as pd
def ConvertStrDateToWeekId(strDate):
dateformat = '2016-7-15 22:44:09'
aDate = pd.to_datetime(strDate)
wk = aDate.isocalendar()[1]
yr = aDate.isocalendar()[0]
Format_4_5_4_date = str(yr) + str(wk)
return Format_4_5_4_date'
and from what I have seen on line I should be able to use it this way:
ml_poLines = result.value.select('PURCHASEORDERNUMBER', 'ITEMNUMBER', PRODUCTCOLORID', 'RECEIVINGWAREHOUSEID', ConvertStrDateToWeekId('CONFIRMEDDELIVERYDATE'))
However when I "show" my dataframe the "CONFIRMEDDELIVERYDATE" column is the original datetime string! NO errors are given.
I've also tried this:
ml_poLines['WeekId'] = (ConvertStrDateToWeekId(ml_poLines['CONFIRMEDDELIVERYDATE']))
and get the following error:
"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions." which makes no sense to me.
I've also tried this with no success.
x = ml_poLines.toPandas();
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
ml_poLines2 = spark.createDataFrame(x)
ml_poLines2.show()
The above generates the following error:
AttributeError: 'Series' object has no attribute 'isocalendar'
What have I done wrong?
Your function ConvertStrDateToWeekId takes a string. But in the following line the argument of the function call is a series of strings:
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
A possible workaround for this error is to use the apply-function of pandas:
x['testDates'] = x['CONFIRMEDDELIVERYDATE'].apply(ConvertStrDateToWeekId)
But without more information about the kind of data you are processing it is hard to provide further help.
This was the work-around that I got to work:
`# convert the confirimedDeliveryDate to a WeekId
x= ml_poLines.toPandas();
x['WeekId'] = x[['ITEMNUMBER', 'CONFIRMEDDELIVERYDATE']].apply(lambda y:ConvertStrDateToWeekId(y[1]), axis=1)
ml_poLines = spark.createDataFrame(x)
ml_poLines.show()`
Not quite as clean as I would like.
Maybe someone else cam propose a cleaner solution.

How do I convert csv string to list in pandas?

I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Categories