Parsing specific columns of CSV in python - python

so I have this CSV and I would like to do the following:
Original data:
Parsed Data:
So, to put in words, if a column has commas then I want to create a new column with only one value and delete the column which has multiple values.
For example: N2 has I1, I3 and I4. Hence the new data gets 3 columns, containing one value only.
I want to make it dynamic in such a way that all the permutations are reflected. Like in the case of N3 that has 2 places and 2 items.
I am trying to use python's pandas to do this. Some help would be appreciated.

Here is another option:
df['Place'] = df['Place'].str.split(',')
df['Item'] = df['Item'].str.split(',')
exploded = pd.DataFrame([
a + [p, t] for *a, P, T in df.values
for p in P for t in T
], columns=df.columns)
And the output:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1
2 N2 P2 I3
3 N2 P2 I4
4 N3 P2 I2
5 N3 P2 I5
6 N3 P3 I2
7 N3 P3 I5

You are effectively attempting to take the Cartesian product of each row, then binding the result back into a DataFrame. As such, you could use itertools and do something like
from itertools import chain, product
df_lists = df.applymap(lambda s: s.split(','))
pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)
With your example input:
In [334]: df
Out[334]:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1,I3,I4
2 N3 P2,P3 I2,I5
In [336]: df_lists = df.applymap(lambda s: s.split(','))
In [337]: pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)
Out[337]:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1
2 N2 P2 I3
3 N2 P2 I4
4 N3 P2 I2
5 N3 P2 I5
6 N3 P3 I2
7 N3 P3 I5

You can use iterrows() :
df = pd.DataFrame({'Name': ['N1', 'N2', 'N3'], 'Place':['P1', 'P2','P2,P3'], 'Item':['I1,', 'I1,I3,I4', 'I2,I5']})
result = pd.DataFrame()
new_result = pd.DataFrame()
df['Place'] = df['Place'].apply(lambda x: x.strip(','))
df['Item'] = df['Item'].apply(lambda x: x.strip(','))
for a,b in df.iterrows():
curr_row = df.iloc[a]
temp = ((curr_row['Place'].split(',')))
for x in temp:
curr_row['Place'] = x
result = result.append(curr_row, ignore_index=True)
for a,b in result.iterrows():
curr_row = result.iloc[a]
temp = ((curr_row['Item'].split(',')))
for x in temp:
curr_row['Item'] = x
new_result = new_result.append(curr_row, ignore_index=True)
Output:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1
2 N2 P2 I3
3 N2 P2 I4
4 N3 P2 I2
5 N3 P2 I5
6 N3 P3 I2
7 N3 P3 I5
This is the simplest way you can achieve your desired output.

You can avoid the use of pandas. If you want to stick with the standard csv module, you simply have to split each field on comma (',') and then iterate on the splitted elements.
Code could be assuming the input delimiter is a semicolon (;) (I cannot know what it is except it cannot be a comma):
with open('input.csv', newline='') as fd, open('output.csv', 'w', newline='') as fdout:
rd = csv.DictReader(fd, delimiter=';')
wr = csv.writer(fdout)
_ = wr.writerow(rd.fieldnames)
for row in rd:
for i in row['Item'].split(','):
i = i.strip()
if len(i) != 0:
for p in row['Place'].split(','):
p = p.strip()
if len(p) != 0:
for n in row['Name'].split(','):
n = n.strip()
if len(n) != 0:
wr.writerow((n,p,i))
Output is:
Name,Place,Item
N1,P1,I1
N2,P2,I1
N2,P2,I3
N2,P2,I4
N3,P2,I2
N3,P3,I2
N3,P2,I5
N3,P3,I5

Related

Regular expression to match the linear system ax+by=c

I'm looking for the best regular expression to match the linear system with 2 unknowns (ax+by=c) for Python module ’re’. Where a, b and c are positive or negative integers and I need to separate the match in
3 groups each one contains the value of a, b and c (with signs): group 1 containing ‘a’ value’s, group 2 containing ‘b’ value’s and group 3 containing ‘c’ value’s.
e.g.:
for -3x+y=-2, group1 will contain -3, group 2 will contain 1 and group 3 will contain -2
e.g.:
x+3y=-4
-2x+y=2
3x-y=2
...
What I used so far is :
r"(^[+-]?\d*)x([+-]?\d*)y=([+-]?\d*)"
It almost woks fine except when it has to deal with a negative sign and a or b are missing.
e.g.:
-x+2y=4
5x-y=3
I have to put 1 before x or y if they're negative to make it work:
-x+2y=4 => -1x+2=4
5x-y=3 => 5x-1y=3
Python code:
import numpy as np
import re
def solve(eq1,eq2):
match1 = re.match(r"(^[+-]?\d*)x([+-]?\d*)y=([+-]?\d*)", eq1)
a1, b1, c1 = match1.groups()
if a1 is None or a1== '':
a1=1
elif a1 == '-':
a1=-1
if b1 is None:
b1=1
elif b1 == '-':
b1=-1
elif b1 == '+':
b1 = 1
a1, b1, c1 = float(a1), float(b1), float(c1)
match2 = re.match(r"([+-]?\d*)x([+-]?\d*)y=([+-]?\d*)", eq2)
a2, b2, c2 = match2.groups()
if a2 is None or a2== '':
a2=1
elif a2 == '-':
a2=-1
if b2 is None:
b2=1
elif b2 == '-':
b2=-1
elif b2 == '+':
b2 = 1
a2, b2, c2 = float(a2), float(b2), float(c2)
A = np.array([[a1, b1], [a2, b2]])
B = np.array([[c1], [c2]])
print(np.linalg.inv(A) # B)
solve("x-y=7","2x+3y=4")
Output:
[[ 5.][-2.]]
Split based on regular expression x|y=, considering empty strings and + or - without numbers.
import re
ee = ['x+3y=-4', '-2x+y=2', '3x-y=2', '-x+2y=4', '5x-y=3']
for e in ee:
print([int(m+'1' if m in ['', '+', '-'] else m)
for m in re.split('x|y=', e)])
Output:
[1, 3, -4]
[-2, 1, 2]
[3, -1, 2]
[-1, 2, 4]
[5, -1, 3]
Update #1:
import numpy as np
import re
def solve(eq1, eq2):
coeffs = []
for e in [eq1, eq2]:
for m in re.split('x|y=', e):
coeffs.append(float(m + '1' if m in '+-' else m))
a1, b1, c1, a2, b2, c2 = coeffs
A = np.array([[a1, b1], [a2, b2]])
B = np.array([[c1], [c2]])
return np.linalg.inv(A) # B
print(solve("x-y=7", "2x+3y=4"))
Output:
[[ 5.]
[-2.]]
Check it online with rextester.

Pandas row index optimized for a particular column

I have an example dataframe as follows
p1 p2 p3 score
0 1 a t1 0.408718
1 1 a t2 0.694732
2 1 a t3 0.001077
3 1 b t1 0.250646
4 1 b t2 0.877506
5 1 b t3 0.033305
6 2 a t1 0.735524
7 2 a t2 0.055166
8 2 a t3 0.579875
9 2 b t1 0.579199
10 2 b t2 0.785301
11 2 b t3 0.339372
p1, p2 and p3 are parameters. What I would like to do is to select the optimal row with p1 and p2 values with the maximum average score based on p3.
For example in the given dataframe, this function should return either one of the rows 9,10,11 since the mean of p3 scores (0.579199, 0.785301, 0.339372) = 0.567958 is the maximum value I can get for any given set of p1 and p2.
My try so far (using pandas groupy) is as follows
temp = []
for eachgroup in df.groupby(['p1', 'p2']).groups.keys():
temp.append(df.groupby(['p1', 'p2']).get_group(eachgroup)['score'])
temp1 = []
for each in temp:
temp1.append(each.mean())
maxidx = temp1.index(max(temp1))
temp[maxidx].index
Returns me the following output
Int64Index([9, 10, 11], dtype='int64')
However, this is very inefficient and works only for smaller dataframes. How can I do the same for bigger dataframes?
In your case
s=df.groupby(['p1','p2']).score.transform('mean')
s.index[s==s.max()]
Out[239]: Int64Index([9, 10, 11], dtype='int64')
Using groupby and transform:
>>> df.groupby(['p1', 'p2']).score.transform('mean').idxmax()
9
If instead you want the combination of p1 and p2 that corresponds with this maximum:
>>> df.groupby(['p1', 'p2']).score.mean().idxmax()
(2, 'b')
The latter would be helpful if you wanted to view the range that created the maximum average:
df.set_index(['p1', 'p2']).loc[(2, 'b')]
p3 score
p1 p2
2 b t1 0.579199
b t2 0.785301
b t3 0.339372
oneliner: groupby p1 and p2, take the mean of the score column for each group. Get the id of the maximum value in the aggregated series.
df.groupby(['p1', 'p2'])['score'].agg(lambda x: x.mean()).idxmax()
>>> ('2', 'b')

Replicating rows in a pandas data frame

I have the following DataFrame:
N numbers
n1 1,2,3
n2 4,6,2
n4 2,5
....
frequency=[0.45, 0.5, 0.05]
Activ = [ 1, 2, 3]
df = shuffle(df)[:20]
Activs=np.random.choice(Activ , len(df), p=frequency)
df['index']=pd.Series(Activs.tolist())
df_new = df.loc[np.repeat(df.index.values,df.index)]
I want to get a data frame of the type of:
df_new:
N numbers index
n1 1,2,3 3
n1 1,2,3 3
n2 4,6,2 2
n2 4,6,2 2
n2 4,6,2 2
n1 1,2,3 1
n4 2,5 2
....
I get an error - in my frame a date value in colum index numbers and NaN
I think column index is not necessary, for np.repeat is possible use array Activs:
df = pd.DataFrame({'numbers': ['1,2,3', '4,6,2', '2,5'], 'N': ['n1', 'n2', 'n4']})
print (df)
N numbers
0 n1 1,2,3
1 n2 4,6,2
2 n4 2,5
frequency=[0.45, 0.5, 0.05]
Activ = [ 1, 2, 3]
df = df[:20]
#for testing
np.random.seed(100)
Activs=np.random.choice(Activ , len(df.index), p=frequency)
print (Activs)
[2 1 1]
df_new = df.loc[np.repeat(df.index,Activs)]
print (df_new)
N numbers
0 n1 1,2,3
0 n1 1,2,3
1 n2 4,6,2
2 n4 2,5
But if need new column from Activs, better is dont use name index if not really necessary - e.g. name is val:
np.random.seed(100)
Activs=np.random.choice(Activ , len(df.index), p=frequency)
print (Activs)
[2 1 1]
df['val'] = Activs
df_new = df.loc[np.repeat(df.index,Activs)]
print (df_new)
N numbers val
0 n1 1,2,3 2
0 n1 1,2,3 2
1 n2 4,6,2 1
2 n4 2,5 1

EOF error python 3?

I keep getting an EOF error in python 3. Here is my code
num = float(input()) #servings
p = float(input()) #people
a2 = float(input())
b2 = float(input())
c2 = float(input())
d2 = float(input())
e2 = float(input())
f2 = float(input())
g2 = float(input())
h2 = float(input())
i2 = float(input())
a1 = a2 / num
b1 = b2 / num
c1 = c2 / num
d1 = d2 / num
e1 = e2 / num
f1 = f2 / num
g1 = g2 / num
h1 = h2 / num
i1 = i2 / num
a = a1 * p
b = b1 * p
c = c1 * p
d = d1 * p
e = e1 * p
f = f1 * p
g = g1 * p
h = h1 * p
i = i1 * p
lis = str(a)+ str(b)+ str(c)+ str(d)+ str(e)+ str(f)+ str(g)+ str(h)+ str(i)
print (lis) #8 14 1 1 6 2 1 2 .5 2
and the error is on line 11. If I delete line 11 and all code that goes with it, it gives me the error on line 10, then 9, then 8, etc.
The code works fine until you give 11 input values since there are 11 input statements. The EOF error occurs when you don't provide sufficient inputs. I assume the comment on the last line is your input and it has only 10 values. I think that's the reason for the EOF error.

Error upon converting a pandas dataframe to spark DataFrame

I created a pandas dataframe out of some StackOverFlow posts. Used lxml.eTree to separate the code_blocks and the text_blocks. Below code shows the basic outline :
import lxml.etree
a1 = tokensentRDD.map(lambda (a,b): (a,''.join(map(str,b))))
a2 = a1.map(lambda (a,b): (a, b.replace("<", "<")))
a3 = a2.map(lambda (a,b): (a, b.replace(">", ">")))
def parsefunc (x):
html = lxml.etree.HTML(x)
code_block = html.xpath('//code/text()')
text_block = html.xpath('// /text()')
a4 = code_block
a5 = len(code_block)
a6 = text_block
a7 = len(text_block)
a8 = ''.join(map(str,text_block)).split(' ')
a9 = len(a8)
a10 = nltk.word_tokenize(''.join(map(str,text_block)))
numOfI = 0
numOfQue = 0
numOfExclam = 0
for x in a10:
if x == 'I':
numOfI +=1
elif x == '?':
numOfQue +=1
elif x == '!':
numOfExclam
return (a4,a5,a6,a7,a9,numOfI,numOfQue, numOfExclam)
a11 = a3.take(6)
a12 = map(lambda (a,b): (a, parsefunc(b)), a11)
columns = ['code_block', 'len_code', 'text_block', 'len_text', 'words#text_block', 'numOfI', 'numOfQ', 'numOfExclam']
index = map(lambda x:x[0], a12)
data = map(lambda x:x[1], a12)
df = pd.DataFrame(data = data, columns = columns, index = index)
df.index.name = 'Id'
df
code_block len_code text_block len_text words#text_block numOfI numOfQ numOfExclam
Id
4 [decimal 3 [I want to use a track-bar to change a form's ... 18 72 5 1 0
6 [div, ] 5 [I have an absolutely positioned , div, conta... 22 96 4 4 0
9 [DateTime] 1 [Given a , DateTime, representing a person's ... 4 21 2 2 0
11 [DateTime] 1 [Given a specific , DateTime, value, how do I... 12 24 2 1 0
I need to create a Spark DataFrame on order to apply machine learning algorithms on the output. I tried:
sqlContext.createDataFrame(df).show()
The error I receive is:
TypeError: not supported type: <class 'lxml.etree._ElementStringResult'>
Can someone tell me a proper way to convert a Pandas DataFrame into A Spark DataFrame?
Your problem is not related to Pandas. Both code_block (a4) and text_block (a6) contain lxml specific objects which cannot be encoded using SparkSQL types. Converting these to strings should be just enough.
a4 = [str(x) for x in code_block]
a6 = [str(x) for x in text_block]

Categories