Problems transforming data in a dataframe - python

I've written the function (tested and working) below:
import pandas as pd
def ConvertStrDateToWeekId(strDate):
dateformat = '2016-7-15 22:44:09'
aDate = pd.to_datetime(strDate)
wk = aDate.isocalendar()[1]
yr = aDate.isocalendar()[0]
Format_4_5_4_date = str(yr) + str(wk)
return Format_4_5_4_date'
and from what I have seen on line I should be able to use it this way:
ml_poLines = result.value.select('PURCHASEORDERNUMBER', 'ITEMNUMBER', PRODUCTCOLORID', 'RECEIVINGWAREHOUSEID', ConvertStrDateToWeekId('CONFIRMEDDELIVERYDATE'))
However when I "show" my dataframe the "CONFIRMEDDELIVERYDATE" column is the original datetime string! NO errors are given.
I've also tried this:
ml_poLines['WeekId'] = (ConvertStrDateToWeekId(ml_poLines['CONFIRMEDDELIVERYDATE']))
and get the following error:
"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions." which makes no sense to me.
I've also tried this with no success.
x = ml_poLines.toPandas();
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
ml_poLines2 = spark.createDataFrame(x)
ml_poLines2.show()
The above generates the following error:
AttributeError: 'Series' object has no attribute 'isocalendar'
What have I done wrong?

Your function ConvertStrDateToWeekId takes a string. But in the following line the argument of the function call is a series of strings:
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
A possible workaround for this error is to use the apply-function of pandas:
x['testDates'] = x['CONFIRMEDDELIVERYDATE'].apply(ConvertStrDateToWeekId)
But without more information about the kind of data you are processing it is hard to provide further help.

This was the work-around that I got to work:
`# convert the confirimedDeliveryDate to a WeekId
x= ml_poLines.toPandas();
x['WeekId'] = x[['ITEMNUMBER', 'CONFIRMEDDELIVERYDATE']].apply(lambda y:ConvertStrDateToWeekId(y[1]), axis=1)
ml_poLines = spark.createDataFrame(x)
ml_poLines.show()`
Not quite as clean as I would like.
Maybe someone else cam propose a cleaner solution.

Related

PySpark: Converting config string items for dataframe operation

Input:
source_dataframe = spark.createDataFrame(
[
(1,"1", "2020-01-01",10),
(1,"2", "2020-01-01",20),
(1,"2", "2020-02-01",30)
],
("country_code", "cust_id","day","value")
)
Config:
input_config = """
[ {
"source":"source_dataframe",
"opearation":"max",
"group":["country_code", "cust_id"]
}
]
"""
import json
config_dict = json.loads(input_config)
print(config_dict)
read from the config and apply operation on the input dataframe: Here I have hardcoded dataframe (source_dataframe) and operation (max).this works fine
for each in config_dict:
result = source_dataframe.groupBy(["country_code", "cust_id"]).agg(max("value"))
result.show()
However instead of harcoding, if I try to read dataframe from config dynamically and apply the operation on input , I am running into different errors. This could be because, while reading they are converted as string. How do I convert the string object so that they work?
Error: 'str' object has no attribute 'groupBy'
result = each['source'].groupBy(["country_code", "cust_id"]).agg(max("value"))
Error: TypeError: 'str' object is not callable
result = source_dataframe.groupBy(["country_code", "cust_id"]).agg(each['opearation']("value"))
This section where I read groupBy dynamically works fine.
result = source_dataframe.groupBy(each["group"]).agg(max("value"))
tried looking other posts, but could not figure out a solution. Can anyone please help.
Maybe you should evaluate the string, which would grant you access to the underlying dataframe.
result = eval(each['source']).groupBy(["country_code", cust_id"]).agg(max("value"))
Can't verify since i got an error from your first part.

Expected unicode, got pandas._libs.properties.CachedProperty

I,m trying to add empty column in my dataset on colab but it give me this error. and when I,m trying to run it on my local machine it works perfectly fine. does anybody know possible solution for this?
My code.
dataframe["Comp"] = ''
dataframe["Negative"] = ''
dataframe["Neutral"] = ''
dataframe["Positive"] = ''
dataframe
Error message
TypeError: Expected unicode, got pandas._libs.properties.CachedProperty
I run into similar issue today.
"Expected unicode, got pandas._libs.properties.CachedProperty"
my dataframe(called df) has timeindex. When add a new column to it, and fill with numpy.array data, it raise this error. I tried set it with df.index or df.index.value. It always raise this error.
Finally, I solved by 3 stesp:
df = df.reset_index()
df['new_column'] = new_column_data # it is np.array format
df = df.set_index('original_index_name')
WY
this Quetion is the same as https://stackoverflow.com/a/67997139/16240186, and there's a simple way to solve it: df = df.asfreq('H') # freq can be min\D\M\S\5min etc.

reg problems:TypeError: expected string or bytes-like object

I am trying the code:
`s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'`
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
data[0]=str(data[0])
data['r_id']=data[0].apply(lambda x:re.search(r'(r_id)',data[0]))
data['level']=data[0].apply(lambda x:re.search(r'(level)',data[0]))
print(data)
I wish I could get the result:
r_id level
1312 307
1111 NAN
But it shows the error:expected string or bytes-like object
So how could I use the re.search in pandas or how could I get result?
My two cents...
import re
pattern = re.compile(r'^.*?id\":\"(\d+)\",\"level\":(\d+).*id\":\"(\d+).*$')
string = r'{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data = pattern.findall(string)
data
Which returns an array:
[('1312', '307', '1111')]
And you can access items with, for example:
data[0][2]
Regex demo: https://regex101.com/r/Inv4gp/1
The below works for me. The type problem arises because you cannot change the type of all the rows like that. You would need a lambda functor for that too.
There is an additional problem, that the regex and the exception case handling won't work like that. I proposed a solution for this, but you might want to consider a different regex if you want this to work for other columns.
I'm very novice with regex, so there might be a more general-purpose solution for your problem.
import re
import pandas as pd
import numpy as np
s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
# This is a regex wrapper which gets the row of our pandas dataframes and the columns that we want.
def regex_wrapper(row,column):
match = re.search(r'"' + column + '":"?(\d+)"?', str(row))
if match:
return match.group(1)
else:
return np.nan
data['r_id'] = data[0].apply(lambda row: regex_wrapper(row,"r_id"))
data['level'] = data[0].apply(lambda row: regex_wrapper(row,"level"))
del data[0]

ValueError: DataFrame constructor not properly called

I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!
Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.

How do I convert csv string to list in pandas?

I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})

Categories