Input:
source_dataframe = spark.createDataFrame(
[
(1,"1", "2020-01-01",10),
(1,"2", "2020-01-01",20),
(1,"2", "2020-02-01",30)
],
("country_code", "cust_id","day","value")
)
Config:
input_config = """
[ {
"source":"source_dataframe",
"opearation":"max",
"group":["country_code", "cust_id"]
}
]
"""
import json
config_dict = json.loads(input_config)
print(config_dict)
read from the config and apply operation on the input dataframe: Here I have hardcoded dataframe (source_dataframe) and operation (max).this works fine
for each in config_dict:
result = source_dataframe.groupBy(["country_code", "cust_id"]).agg(max("value"))
result.show()
However instead of harcoding, if I try to read dataframe from config dynamically and apply the operation on input , I am running into different errors. This could be because, while reading they are converted as string. How do I convert the string object so that they work?
Error: 'str' object has no attribute 'groupBy'
result = each['source'].groupBy(["country_code", "cust_id"]).agg(max("value"))
Error: TypeError: 'str' object is not callable
result = source_dataframe.groupBy(["country_code", "cust_id"]).agg(each['opearation']("value"))
This section where I read groupBy dynamically works fine.
result = source_dataframe.groupBy(each["group"]).agg(max("value"))
tried looking other posts, but could not figure out a solution. Can anyone please help.
Maybe you should evaluate the string, which would grant you access to the underlying dataframe.
result = eval(each['source']).groupBy(["country_code", cust_id"]).agg(max("value"))
Can't verify since i got an error from your first part.
I,m trying to add empty column in my dataset on colab but it give me this error. and when I,m trying to run it on my local machine it works perfectly fine. does anybody know possible solution for this?
My code.
dataframe["Comp"] = ''
dataframe["Negative"] = ''
dataframe["Neutral"] = ''
dataframe["Positive"] = ''
dataframe
Error message
TypeError: Expected unicode, got pandas._libs.properties.CachedProperty
I run into similar issue today.
"Expected unicode, got pandas._libs.properties.CachedProperty"
my dataframe(called df) has timeindex. When add a new column to it, and fill with numpy.array data, it raise this error. I tried set it with df.index or df.index.value. It always raise this error.
Finally, I solved by 3 stesp:
df = df.reset_index()
df['new_column'] = new_column_data # it is np.array format
df = df.set_index('original_index_name')
WY
this Quetion is the same as https://stackoverflow.com/a/67997139/16240186, and there's a simple way to solve it: df = df.asfreq('H') # freq can be min\D\M\S\5min etc.
I am trying the code:
`s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'`
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
data[0]=str(data[0])
data['r_id']=data[0].apply(lambda x:re.search(r'(r_id)',data[0]))
data['level']=data[0].apply(lambda x:re.search(r'(level)',data[0]))
print(data)
I wish I could get the result:
r_id level
1312 307
1111 NAN
But it shows the error:expected string or bytes-like object
So how could I use the re.search in pandas or how could I get result?
My two cents...
import re
pattern = re.compile(r'^.*?id\":\"(\d+)\",\"level\":(\d+).*id\":\"(\d+).*$')
string = r'{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data = pattern.findall(string)
data
Which returns an array:
[('1312', '307', '1111')]
And you can access items with, for example:
data[0][2]
Regex demo: https://regex101.com/r/Inv4gp/1
The below works for me. The type problem arises because you cannot change the type of all the rows like that. You would need a lambda functor for that too.
There is an additional problem, that the regex and the exception case handling won't work like that. I proposed a solution for this, but you might want to consider a different regex if you want this to work for other columns.
I'm very novice with regex, so there might be a more general-purpose solution for your problem.
import re
import pandas as pd
import numpy as np
s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
# This is a regex wrapper which gets the row of our pandas dataframes and the columns that we want.
def regex_wrapper(row,column):
match = re.search(r'"' + column + '":"?(\d+)"?', str(row))
if match:
return match.group(1)
else:
return np.nan
data['r_id'] = data[0].apply(lambda row: regex_wrapper(row,"r_id"))
data['level'] = data[0].apply(lambda row: regex_wrapper(row,"level"))
del data[0]
I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!
Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.
I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})