Error upon converting a pandas dataframe to spark DataFrame - python

I created a pandas dataframe out of some StackOverFlow posts. Used lxml.eTree to separate the code_blocks and the text_blocks. Below code shows the basic outline :
import lxml.etree
a1 = tokensentRDD.map(lambda (a,b): (a,''.join(map(str,b))))
a2 = a1.map(lambda (a,b): (a, b.replace("<", "<")))
a3 = a2.map(lambda (a,b): (a, b.replace(">", ">")))
def parsefunc (x):
html = lxml.etree.HTML(x)
code_block = html.xpath('//code/text()')
text_block = html.xpath('// /text()')
a4 = code_block
a5 = len(code_block)
a6 = text_block
a7 = len(text_block)
a8 = ''.join(map(str,text_block)).split(' ')
a9 = len(a8)
a10 = nltk.word_tokenize(''.join(map(str,text_block)))
numOfI = 0
numOfQue = 0
numOfExclam = 0
for x in a10:
if x == 'I':
numOfI +=1
elif x == '?':
numOfQue +=1
elif x == '!':
numOfExclam
return (a4,a5,a6,a7,a9,numOfI,numOfQue, numOfExclam)
a11 = a3.take(6)
a12 = map(lambda (a,b): (a, parsefunc(b)), a11)
columns = ['code_block', 'len_code', 'text_block', 'len_text', 'words#text_block', 'numOfI', 'numOfQ', 'numOfExclam']
index = map(lambda x:x[0], a12)
data = map(lambda x:x[1], a12)
df = pd.DataFrame(data = data, columns = columns, index = index)
df.index.name = 'Id'
df
code_block len_code text_block len_text words#text_block numOfI numOfQ numOfExclam
Id
4 [decimal 3 [I want to use a track-bar to change a form's ... 18 72 5 1 0
6 [div, ] 5 [I have an absolutely positioned , div, conta... 22 96 4 4 0
9 [DateTime] 1 [Given a , DateTime, representing a person's ... 4 21 2 2 0
11 [DateTime] 1 [Given a specific , DateTime, value, how do I... 12 24 2 1 0
I need to create a Spark DataFrame on order to apply machine learning algorithms on the output. I tried:
sqlContext.createDataFrame(df).show()
The error I receive is:
TypeError: not supported type: <class 'lxml.etree._ElementStringResult'>
Can someone tell me a proper way to convert a Pandas DataFrame into A Spark DataFrame?

Your problem is not related to Pandas. Both code_block (a4) and text_block (a6) contain lxml specific objects which cannot be encoded using SparkSQL types. Converting these to strings should be just enough.
a4 = [str(x) for x in code_block]
a6 = [str(x) for x in text_block]

Related

Pandas compare next row and merge based on conditions

I have below dataframe. Where START+TIME=END
I want ti check id END of current row = START of next row then merge that 2 rows providing "ID" hsould the same
So the output should look like,
So the output is as below
Sample DF
Start Time End ID
0 43500 60 43560 23
1 43560 60 43620 23
2 43620 1020 44640 24
3 44640 260 44900 24
4 44900 2100 47000 24
Code:
a = df["ID"].tolist()
arr = []
t = True
for i in sorted(list(set(a))):
j = 1
k = 0
temp = {}
tempdf = df[df["ID"] == i]
temp["Start"] = tempdf.iloc[k]["Start"]
temp["Time"] = tempdf.iloc[k]["Time"]
temp["End"] = tempdf.iloc[k]["End"]
temp["ID"] = tempdf.iloc[k]["ID"]
while j < len(tempdf):
if temp["End"] == tempdf.iloc[j]["Start"]:
temp["End"] = tempdf.iloc[j]["End"]
temp["Time"] += tempdf.iloc[j]["Time"]
j += 1
arr.append(temp)
df = pd.DataFrame(arr)
Output DF:
Start Time End ID
0 43500 120 43620 23
1 43620 3380 47000 24
I'm not sure how your data is formatted exactly but you can just replace.
I suggest you use numpy and try something along the lines :
i=0
while i != len(data):
if data[i][4] == data[i+1][2]:
data[i][4] = data[i+1][2]
data[i+1].pop
else :
i+=1

How to merge dataframes and return the merged df from a function?

I am reading data from yahoo through a function. I have a list of stocks and end and start time defined. Then using the code below I am iterating through the list of stocks and storing each stock data seperately under name X+'StockName'. I think this part is working fine.
But then I want to generate a dataframe that merges all these single stock data and gives it as a result of the function. I am stuck there badly.
Can you plese help me out?
secs = ['UNH','XOM','HD','DIS','GE','USB','ORCL','KO','PEP','MMM']
DataCollector = ""
def DataCollection(secList,startTime,endTime):
newList = []
for i in range(len(secList)):
DataCollector = 'X' + str(secList[i])
print(DataCollector)
newList.append(DataCollector)
print(newList)
DataCollector = pd.DataFrame(pdr.get_data_yahoo(secList[i], start = start, end = end)['Adj Close'])
data = pd.concat(pd.Series(newList))
I have tried many ways and this is the last error I got for the code above.
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
You need to just replace the last line of your function to
def DataCollection(secList,startTime,endTime):
newList = []
for i in range(len(secList)):
DataCollector = 'X' + str(secList[i])
DataCollector = pd.DataFrame(pdr.get_data_yahoo(secList[i], start = start, end = end)['Adj Close'])
newList.append(DataCollector)
return pd.concat(newList).reset_index
Example
import pandas as pd
import numpy as np
def DataCollection():
newList = []
for _ in range(2):
df = pd.DataFrame({"X" : np.random.randint(0,10,size=4), "Y" : list("abcd")})
print(df)
print("*" * 10)
newList.append(df)
return pd.concat(newList).reset_index()
print(DataCollection())
Output
X Y
0 9 a
1 4 b
2 7 c
3 6 d
**********
X Y
0 5 a
1 0 b
2 0 c
3 6 d
**********
index X Y
0 0 9 a
1 1 4 b
2 2 7 c
3 3 6 d
4 0 5 a
5 1 0 b
6 2 0 c
7 3 6 d
The code below solves the problem. creating newList after assigning DataCollector as dataframe.
def DataCollection(secList,startTime,endTime):
newList = []
for i in range(len(secList)):
DataCollector = 'X' + str(secList[i])
print(DataCollector)
DataCollector = pd.DataFrame(pdr.get_data_yahoo(secList[i], start = start, end = end)['Adj Close'])
newList.append(DataCollector)
return pd.concat(newList)

I want to make a1=0, a2=0.. aN=0 [duplicate]

This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 4 years ago.
I want to make a1=0, a2=0,... aN=0.
I thought using "for"
For example N=10
for i in range(0, 10):
print('a%d'%i)
but it isn't not zeros(just print).
So, I did 'a%d'%i=0. but It didn't work
How can I make that?
For printing use .format() (or f-strings on python 3.6+ :
for i in range(0, 10):
print('a{} = {}'.format(i,i)) # the 1st i is put into the 1. {}, the 2nd i is put ...
If you want to calculate with those as names, store them into a dictionary and use the values to calculate with them:
d = {}
for i in range(0, 10):
d["a{}".format(i)] = i # the nth i is put instead nth {}
print("sum a4 to a7: {} + {} + {} + {} = {}".format( # use the values stored in dict to
d["a4"], ["a5"], ["a6"], ["a7"], # calculate and print the single
d["a4"]+d["a5"]+d["a6"]+d["a7"])) # values where needed
Output:
# for loop
a0 = 0
a1 = 1
a2 = 2
a3 = 3
a4 = 4
a5 = 5
a6 = 6
a7 = 7
a8 = 8
a9 = 9
# calculation
sum a4 to a7: 4 + ['a5'] + ['a6'] + ['a7'] = 22
You can use a dictionary for that.
var_name = 'a'
for i in range(0, 10):
key = var_name + str(i) # an
new_values[key] = 0 # assign 0 to the new name
For accessing them individually,
new_values['a1']
>>> 0
or you can access them all together like this,
for k,v in new_values.items():
print(k,'=',v)
outputs:
a0 = 0
a1 = 0
a2 = 0
a3 = 0
a4 = 0
a5 = 0
a6 = 0
a7 = 0
a8 = 0
a9 = 0
Simple solution, using const value x=0, and counter i:
x = 0
for i in range(0,10):
print(f"a{i} = {x}")
output:
a0 = 0
a1 = 0
a2 = 0
a3 = 0
a4 = 0
a5 = 0
a6 = 0
a7 = 0
a8 = 0
a9 = 0

Pandas track consecutive near numbers via compare-cumsum-groupby pattern

I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41

Best way to create a dataframe from several lists

I am working through ThinkStats, but decided to learn Pandas along the ways as well. So the code below reads in data from a file, does some checking and then appends the data to a list. I end up with several lists containing the data I need. The code below works (except for scrambling up the columns...)
My question is: What is the best way to build a dataframe from these lists? More generally, am I accomplishing my goal in the most efficient manner?
preglength = []
caseid = []
outcome = []
birthorder = []
finalweight = []
with open('2002FemPreg.dat') as f:
for line in f:
caseid.append(int(line[0:13].strip()))
preglength.append(int(line[274:276].strip()))
outcome.append(int(line[276].strip()))
try:
birthorder.append(int(line[277:279]))
except ValueError:
birthorder.append(np.nan)
finalweight.append(float(line[422:440].strip()))
c1 = pd.Series(caseid)
c2 = pd.Series(preglength)
c3 = pd.Series(outcome)
c4 = pd.Series(birthorder)
c5 = pd.Series(finalweight)
data = pd.DataFrame({'caseid': c1,'preglength': c2,'outcome': c3,'birthorder': c4,'weight': c5})
print(data.head())
I would probably use read_fwf:
>>> df = pd.read_fwf("2002FemPreg.dat",
... colspecs=[(0,13), (274, 276), (276, 277), (277, 279), (422, 440)],
... names=["caseid", "preglength", "outcome", "birthorder", "finalweight"])
>>> df.head()
caseid preglength outcome birthorder finalweight
0 1 39 1 1 6448.271112
1 1 39 1 2 6448.271112
2 2 39 1 1 12999.542264
3 2 39 1 2 12999.542264
4 2 39 1 3 12999.542264
[5 rows x 5 columns]

Categories