Python SQL Data to Array (like csv) - python

currently I am trying to read my SQL-Data to an array in python.
I am new to python so please be kind ;)
My csv-export can be read easily:
data = pd.read_csv('esoc_data.csv', header = None)
x = data[[1,2,3,4,5,6,7,8,9,10,11,12]]
This one picks the second column (starting from 1, not 0) till 12th column of my dataset. I need this data in this exact format!
Now I want to do the same with the data I get from my SQL-fetch.
names = [i for i in cursor.fetchall()]
This one gives me my data with all (0-12) columns and separated by ","
Result:
[(name#mail.com', 13, 13, 0, 24, 2, 0, 20, 3, 0, 31, 12, 2), (...)]
Now .. how do I get this into my "specific" format I mentioned before?
I just need the numbers like this:
1 2 3 4 5 6 7 8 9 10 11 12
0 13 13 0 24 2 0 20 3 0 31 12 2
1 21 0 0 24 0 0 32 0 0 30 0 0
2 9 7 0 26 31 0 19 27 0 30 32 2
I'm sorry if this is peanuts for you.

You can run a multi-loop for this, something like
def our_method():
parent_list = list()
for name in names:
child_list = list()
for index, item in enumerate(name):
if index != 0:
child_list.append(item)
parent_list.append(child_list)
return parent_list

Related

How do I split a dataframe by a list of numbers of rows for each chunk?

I have a dataframe which I want to cut up according to the elements in a list. For example, I have a range_list[16, 14, 2...]
I then want to cut up the dataframe so that the first chunk will be 16 rows long, the second part 14, the third part 2.. etc. It could be beneficial to put this in a list as well.
Use numpy.split. This can take a range of indices to slice on, so will require you to cumsum your range list.
indices = np.cumsum(range_list, dtype=np.int32)
np.split(df, indices)
Example
range_list = [16, 14, 2]
np.random.seed(0)
df = pd.DataFrame(np.random.randn(sum(range_list), 2))
indices = np.cumsum(range_list, dtype=np.int32)
np.split(df, indices)
[returns]
Returns a list with 3 DataFrames in this example, of shapes (16, 2) , (14, 2) & (2, 2)
[ 0 1
0 1.060679 1.092185
1 -0.043971 -1.394001
2 1.106233 -0.711420
3 -0.585148 0.179987
4 -0.871562 0.730840
5 0.810119 -0.130510
6 -0.957646 -0.324547
7 0.235788 -0.460025
8 -0.262714 -0.496833
9 0.454519 -1.244402
10 0.084796 1.587114
11 -0.353880 1.110543
12 -0.570345 0.774158
13 1.772536 1.283950
14 -1.682226 -0.376789
15 0.956894 0.081805, 0 1
16 0.014841 0.110091
17 -0.408881 0.260970
18 0.004939 0.940186
19 -2.056951 0.353928
20 0.618294 -2.201036
21 1.375224 0.526367
22 -0.424886 -1.253565
23 1.785862 0.774936
24 -0.341340 -1.056191
25 -0.274463 -1.637185
26 1.596336 2.311630
27 -0.479840 1.021640
28 -1.307765 -0.232664
29 0.243427 0.339242, 0 1
30 0.345476 0.331306
31 0.895437 -1.163441, Empty DataFrame
Columns: [0, 1]
Index: []]
I'm not sure if I understand you correctly.
If you just want to split the list you can do something like this:
def split_list(l, range_list):
i = 0
for x in range_list:
start = i
end = start + x
print(l[start:end])
You can create an array with the cumulative sum of the list elements, add an initial zero, and a final -1, then iterate over it for slicing the initial dataframe:
ls = [16,14,2, ..]
chucks = np.cumsum(ls)
c=np.zeros(len(chucks)+2)
c[1:-1] = chucks
c[-1] = -1
all_dfs= []
for i range(len(c)-1):
df_list.append(df[c[i]:c[i+1]])

Sort on 2 columns which are inter

I have a dataframe :
start end
1 10
26 50
6 15
1 5
11 25
I expect following dataframe :
start end
1 10
11 25
26 50
1 5
6 15
here sort order is noting but end of nth row must be start+1 of n+1th row.If not found, search for other starts where start is one.
can anyone suggest what combination of sort and group by can I use to convert above dataframe in required format?
You could transform the df to a list and then do:
l=[1,10,26,50,6,15,1,5,11,25]
result=[]
for x in range(int(len(l)/2)):
result.append(sorted([l[2*x],l[2*x+1]])[1])
result.append(sorted([l[2*x],l[2*x+1]])[0])
This will give you result:
[1, 10, 26, 50, 6, 15, 1, 5, 11, 25]
To transform the original df to list you can do:
startcollist=df['start'].values.tolist()
endcollist=df['end'].values.tolist()
l=[]
for index, each in enumerate(originaldf):
l.append(each)
l.append(endcollist[index])
You can then transform result back to a dataframe:
df=pd.DataFrame({'start':result[1::2], 'end':result[0::2]})
Giving the result:
end start
0 10 1
1 50 26
2 15 6
3 5 1
4 25 11
The expression result[1::2] gives every odd element of result, result[0::2] gives every even element. For explanation, see here: https://stackoverflow.com/a/12433705/8565438

Python: How to create weighted quantiles in Pandas?

I understand how to create simple quantiles in Pandas using pd.qcut. But after searching around, I don't see anything to create weighted quantiles. Specifically, I wish to create a variable which bins the values of a variable of interest (from smallest to largest) such that each bin contains an equal weight. So far this is what I have:
def wtdQuantile(dataframe, var, weight = None, n = 10):
if weight == None:
return pd.qcut(dataframe[var], n, labels = False)
else:
dataframe.sort_values(var, ascending = True, inplace = True)
cum_sum = dataframe[weight].cumsum()
cutoff = max(cum_sum)/n
quantile = cum_sum/cutoff
quantile[-1:] -= 1
return quantile.map(int)
Is there an easier way, or something prebuilt from Pandas that I'm missing?
Edit: As requested, I'm providing some sample data. In the following, I'm trying to bin the "Var" variable using "Weight" as the weight. Using pd.qcut, we get an equal number of observations in each bin. Instead, I want an equal weight in each bin, or in this case, as close to equal as possible.
Weight Var pd.qcut(n=5) Desired_Rslt
10 1 0 0
14 2 0 0
18 3 1 0
15 4 1 1
30 5 2 1
12 6 2 2
20 7 3 2
25 8 3 3
29 9 4 3
45 10 4 4
I don't think this is built-in to Pandas, but here is a function that does what you want in a few lines:
import numpy as np
import pandas as pd
from pandas._libs.lib import is_integer
def weighted_qcut(values, weights, q, **kwargs):
'Return weighted quantile cuts from a given series, values.'
if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
order = weights.iloc[values.argsort()].cumsum()
bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
return bins.sort_index()
We can test it on your data this way:
data = pd.DataFrame({
'var': range(1, 11),
'weight': [10, 14, 18, 15, 30, 12, 20, 25, 29, 45]
})
data['qcut'] = pd.qcut(data['var'], 5, labels=False)
data['weighted_qcut'] = weighted_qcut(data['var'], data['weight'], 5, labels=False)
print(data)
The output matches your desired result from above:
var weight qcut weighted_qcut
0 1 10 0 0
1 2 14 0 0
2 3 18 1 0
3 4 15 1 1
4 5 30 2 1
5 6 12 2 2
6 7 20 3 2
7 8 25 3 3
8 9 29 4 3
9 10 45 4 4

input a none-regular matrix in python

link: https://cw.felk.cvut.cz/courses/a4b33alg/task.php?task=pary_py&idu=2341
I want to input the matrix split by space by using:
def neighbour_pair(l):
matrix = [[int(row) for row in input().split()] for i in range(l)]
but the program told me
TypeError: 'str' object cannot be interpreted as an integer
It seems the .split() didn't work but I don't know why.
here is an example of the input matrix:
13 5
7 50 0 0 1
2 70 10 11 0
4 30 9 0 0
6 70 0 0 0
1 90 8 12 0
9 90 0 2 1
13 90 0 6 0
5 30 4 3 0
12 80 0 0 1
10 50 0 0 1
11 50 0 0 0
3 80 1 13 0
8 70 7 0 1
The input is a binary tree with N nodes, the nodes are labeled by numbers 1 to N in random order, each label is unique. Each node contains an integer key in the range from 0 to (2^31)−1.
The first line of input contains two integers N and R separated by space. N is the number of nodes in the tree, R is the label of the tree root.
Next, there are N lines. Each line describes one node and the order of the nodes is arbitrary. A node is specified by five integer values. The first value is the node label, the second value is the node key, the third and the fourth values represent the labels of the left and right child respectively, and the fifth value represents the node color, white is 0, black is 1. If any of the children does not exist there is value 0 instead of the child label at the corresponding place. The values on the line are separated by a space.
This is the range() complaining that your l variable is a string:
>>> range('1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
I suspect you are reading the l from the standard in as well, cast it to integer:
l = int(input())
matrix = [[int(row) for row in input().split()] for i in range(l)]
I agree with #alecxe. It seems that your error is in reference to the string being used as l in your range(l) function. If I put a static int in the range() function it seems to work. 3 followed by three rows of input, will give me the below output.
>>> l = input() # define the number of rows expected the input matrix
>>> [[int(row) for row in input().split()] for i in range(int(l))]
13 5
7 50 0 0 1
2 70 10 11 0
output
[[13, 5], [7, 50, 0, 0, 1], [2, 70, 10, 11, 0]]
Implemented as a method, per the OP request in the comments below:
def neighbour_pair():
l = input()
return [[int(row) for row in input().split()] for i in range(int(l))]
print( neighbour_pair() )
# input
# 3
# 13 5
# 7 50 0 0 1
# 2 70 10 11 0
# output
[[13, 5], [7, 50, 0, 0, 1], [2, 70, 10, 11, 0]]
Still nothing wrong with this implementation...

outputting python/numpy arrays as columns

I'm very new to python, but have been using it to calculate and filter through data. I'm trying to output my array so I can pass it to other programs, but the output is one solid piece of text, with brackets and commas separating it.
I understand there are ways of manipulating this, but I want to understand why my code has output it in this format, and how to make it output it in nice columns instead.
The array was generated with:
! /usr/bin/env python
import numpy as np
import networkx
import gridData
from scipy.spatial.distance import euclidean
INPUT1=open("test_area.xvg",'r')
INPUT2=open("test_atom.xvg",'r')
OUTPUT1= open("negdist.txt",'w')
area = []
pointneg = []
posneg = []
negdistance =[ ]
negresarea = []
while True:
line = INPUT1.readline()
if not line:
break
col = line.split()
if col:
area.append(((col[0]),float(col[1])))
pointneg.append((-65.097000,5.079000,-9.843000))
while True:
line = INPUT2.readline()
if not line:
break
col = line.split()
if col:
pointneg.append((float(col[5]),float(col[6]),float(col[7])))
posneg.append((col[4]))
for col in posneg:
negresarea.append(area[int(col)-1][1])
a=len(pointneg)
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print >> OUTPUT1, negdistance
example output:
[(-1, 1.22333, 0.0), (-1, 1.24223, 153.4651968428021), (-1, 1.48462, 148.59335545709976), (-1, 1.39778, 86.143305392816202), (-1, 0.932278, 47.914688322058403), (-1, 1.04997, 28.622555546282022),
desired output:
[-1, 1.22333, 0.0
-1, 1.24223, 153.4651968428021
-1, 1.48462, 148.59335545709976
-1, 1.39778, 86.143305392816202
-1, 0.932278, 47.914688322058403
-1, 1.04997, 28.622555546282022...
Example inputs:
example input1
1 2.12371 0
2 1.05275 0
3 0.865794 0
4 0.933986 0
5 1.09092 0
6 1.22333 0
7 1.54639 0
8 1.24223 0
9 1.10928 0
10 1.16232 0
11 0.60942 0
12 1.40117 0
13 1.58521 0
14 1.00011 0
15 1.18881 0
16 1.68442 0
17 0.866275 0
18 1.79196 0
19 1.4375 0
20 1.198 0
21 1.01645 0
22 1.82221 0
23 1.99409 0
24 1.0728 0
25 0.679654 0
26 1.15578 0
27 1.28326 0
28 1.00451 0
29 1.48462 0
30 1.33399 0
31 1.13697 0
32 1.27483 0
33 1.18738 0
34 1.08141 0
35 1.15163 0
36 0.93699 0
37 0.940171 0
38 1.92887 0
39 1.35721 0
40 1.85447 0
41 1.39778 0
42 1.97309 0
Example Input2
ATOM 35 CA GLU 6 56.838 -5.202 -102.459 1.00273.53 C
ATOM 55 CA GLU 8 54.729 -6.650 -96.930 1.00262.73 C
ATOM 225 CA GLU 29 5.407 -2.199 -58.801 1.00238.62 C
ATOM 321 CA GLU 41 -24.633 -0.327 -34.928 1.00321.69 C
The problem is the multiple parenthesis when you append. You are appending tuples.
what you want is to be adding lists - i.e. the ones with square brackets.
import numpy as np
area = []
with open('example2.txt') as filehandle:
for line in filehandle:
if line.strip() == '':continue
line = line.strip().split(',')
area.append([int(line[0]),float(line[1]),float(line[2])])
area = np.array(area)
print(area)
'example2.txt' is the data you provided made into a csv
I didn't really get an answer that enabled me to understand the problem, the one suggested above just prevented to whole code working properly. I did find a work around by including the print command in the loop defining my final output.
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print negdistance
negdistance =[]

Categories