Related
I have a printed output:
{-1: [2, 10, 11, 13, 16, 19, 24, 28, 30, 32, 34, 35, 36, 40, 42, 49, 54, 56, 59, 64, 66, 78, 94, 99, 101, 102, 103, 106, 107, 109, 110, 114, 117, 123, 126, 127, 129, 131, 132, 133, 136, 144, 146, 147, 150, 155, 156, 164, 166, 177, 179, 181, 182, 188, 190, 192, 194, 201, 202, 204, 209, 214, 217, 220, 221, 225, 231, 232, 234, 235, 236, 240, 244, 246, 248, 253, 254, 257, 259, 260, 261, 262, 263, 264, 265, 266, 268, 271, 275, 277, 279, 280, 281, 285, 286, 287, 288, 297, 302, 309, ...], 0: [3, 6, 8, 25, 27, 33, 38, 57, 62, 63, 67, 69, 70, 72, 74, 83, 89, 91, 92, 98, 111, 112, 122, 124, 135, 158, 175, 187, 197, 198, 199, 200, 205, 206, 207, 215, 216, 242, 243, 258, 267, 272, 283, 299, 300, 303, 305, 306, 307, 310, 311, 312, 313, 314, 315, 316, 319, 326, 329, 348, 353, 355, 376, 377, 378, 380, 385, 386, 387, 389, 399, 402, 406, 418, 424, 425, 426, 427, 431, 432, 433, 434, 435, 447, 486, 487, 503, 511, 512, 514, 515, 524, 525, 535, 536, 539, 547, 549, 550, 554, ...], 1: [0, 5, 21, 44, 46, 48, 51, 82, 115, 118, 274, 293, 330, 331, 332, 361, 401, 413, 507, 520, 522, 523, 558, 560, 643, 650, 681, 700, 734, 747, 753, 782, 784, 836, 839, 893, 905, 934, 951, 976, 999, 1037, 1048, 1052, 1053, 1082, 1109, 1113, 1115, 1121, 1139, 1146, 1219, 1221, 1264, 1355, 1382, 1392, 1432, 1467, 1485, 1490, 1497, 1513, 1526, 1565, 1682, 1728, 1737, 1738, 1806, 1815, 1824, 1828, 1844, 1845, 1885, 1959, 2014, 2017, 2029, 2052, 2072, 2153, 2157, 2168, 2193, 2199, 2214, 2228, 2232, 2240, 2243, 2264, 2300, 2317, 2353, 2376, 2402, 2405, ...], 2: [15, 39, 60, 61, 149, 157, 222, 250, 289, 320, 448, 538, 630, 658, 662, 665, 709, 759, 810, 837, 897, 901, 917, 924, 925, 945, 946, 954, 959, 1049, 1050, 1090, 1131, 1140, 1154, 1172, 1251, 1300, 1313, 1328, 1387, 1393, 1431, 1440, 1448, 1475, 1507, 1535, 1591, 1597, 1603, 1615, 1636, 1705, 1725, 1736, 1771, 1777, 1791, 1796, 1855, 1867, 1903, 1918, 1928, 1930, 1942, 1943, 1989, 2021, 2039, 2095, 2119, 2169, 2195, 2309, 2337, 2418, 2426, 2429, 2522, 2582, 2598, 2678, 2679, 2682], 3: [50, 113, 160, 213, 224, 229, 238, 239, 352, 400, 409, 506, 545, 570, 701, 703, 712, 716, 830, 838, 858, 921, 1008, 1078, 1124, 1130, 1194, 1214, 1305, 1308, 1311, 1360, 1421, 1441, 1473, 1476, 1532, 1533, 1548, 1580, 1616, 1622, 1649, 1679, 1735, 1883, 1897, 1920, 1985, 2015, 2084, 2091, 2097, 2118, 2152, 2181, 2212, 2223, 2237, 2249, 2310, 2313, 2347, 2369, 2381, 2390, 2470, 2496, 2511, 2514, 2529, 2549, 2569, 2601, 2626, 2666, 2688],
Is it possible i can put this to dataframe
, suppose to column: For example:
Number
Value
-1
[2, 10, 11, 13, 16, 19, 24, 28, 30, 32, 34, 35, 36, 40, 42, 49, 54, 56, 59, 64, 66, 78, 94, 99, 101, 102, 103, 106, 107, 109, 110, 114, 117, 123, 126, 127, 129, 131, 132, 133, 136, 144, 146, 147, 150, 155, 156, 164, 166, 177, 179, 181, 182, 188, 190, 192, 194, 201, 202, 204, 209, 214, 217, 220, 221, 225, 231, 232, 234, 235, 236, 240, 244, 246, 248, 253, 254, 257, 259, 260, 261, 262, 263, 264, 265, 266, 268, 271, 275, 277, 279, 280, 281, 285, 286, 287, 288, 297, 302, 309, ...]
0
[3, 6, 8, 25, 27, 33, 38, 57, 62, 63, 67, 69, 70, 72, 74, 83, 89, 91, 92, 98, 111, 112, 122, 124, 135, 158, 175, 187, 197, 198, 199, 200, 205, 206, 207, 215, 216, 242, 243, 258, 267, 272, 283, 299, 300, 303, 305, 306, 307, 310, 311, 312, 313, 314, 315, 316, 319, 326, 329, 348, 353, 355, 376, 377, 378, 380, 385, 386, 387, 389, 399, 402, 406, 418, 424, 425, 426, 427, 431, 432, 433, 434, 435, 447, 486, 487, 503, 511, 512, 514, 515, 524, 525, 535, 536, 539, 547, 549, 550, 554, ...],
Try:
dct = {
-1: [2, 10, 11],
0: [3, 6, 27, 33],
1: [0, 5, 21],
2: [15],
3: [50, 113, 160, 213, 224],
}
df = pd.DataFrame({"Number": dct.keys(), "Value": dct.values()})
print(df)
Prints:
Number Value
0 -1 [2, 10, 11]
1 0 [3, 6, 27, 33]
2 1 [0, 5, 21]
3 2 [15]
4 3 [50, 113, 160, 213, 224]
df = pd.DataFrame()
df["Value"] = list(d.values())
df.index = d.keys()
# OR
df = pd.DataFrame.from_dict({k: [v] for k, v in d.items()},
orient="index",
columns=["Value"])
print(df)
# Value
# -1 [2, 10, 11, 13, 16, 19, 24, 28, 30, 32, 34, 35...
# 0 [3, 6, 8, 25, 27, 33, 38, 57, 62, 63, 67, 69, ...
# 1 [0, 5, 21, 44, 46, 48, 51, 82, 115, 118, 274, ...
# 2 [15, 39, 60, 61, 149, 157, 222, 250, 289, 320,...
# 3 [50, 113, 160, 213, 224, 229, 238, 239, 352, 4...
Code:
import pandas as pd
df = pd.DataFrame(list(inverted_index.items()),columns = ['words','docids'])
from pandas.io import sql
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://{user}:{pw}#localhost/{db}"
.format(user="root",
pw="shreshre",
db="nltk"))
df.to_sql(con=engine, name='documents', if_exists='replace')
Output:
Here I want convert my inverted index, which is in dictionary type, into a dataframe and write it in MySQL. But I am receiving an error:
OperationalError: (pymysql.err.OperationalError) (1241, 'Operand should contain 1 column(s)')
[SQL: INSERT INTO documents (`index`, words, docids) VALUES (%(index)s, %(words)s, %(docids)s)]
[parameters: ({'index': 0, 'words': 'bank', 'docids': {0, 1, 2, 3, 4, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 37, 38, 39, 40, 41, 43, 44, 45, 46, 48, 52, 53, 54, 55, 56, 59, 60, 62, 64, 66, 67, 68, 69 ... (1314 characters truncated) ... 719, 720, 721, 722, 724, 726, 728, 733, 734, 735, 736, 737, 739, 740, 743, 746, 748, 752, 753, 755, 756, 757, 758, 759, 762, 765, 766, 767, 768, 772}}, {'index': 1, 'words': 'defin', 'docids': {0, 2, 354, 612, 773, 76, 84}}, {'index': 2, 'words': 'establish', 'docids': {0, 161, 391, 328, 330, 718, 719, 720, 722, 245, 217, 411, 156}}, {'index': 3, 'words': 'custodi', 'docids': {0, 405}}, {'index': 4, 'words': ',', 'docids': {0, 1, 2, 3, 8, 14, 17, 20, 22, 24, 25, 26, 27, 30, 31, 41, 43, 45, 48, 49, 51, 52, 54, 55, 59, 62, 63, 65, 67, 69, 70, 72, 73, 74, 76, 78, 79, 80, 81 ... (1428 characters truncated) ... 705, 706, 708, 710, 711, 712, 716, 718, 719, 721, 722, 724, 729, 730, 732, 735, 736, 741, 743, 745, 749, 756, 757, 758, 762, 766, 768, 769, 771, 773}}, {'index': 5, 'words': 'loan', 'docids': {0, 512, 517, 519, 538, 29, 33, 34, 557, 558, 47, 559, 564, 574, 578, 580, 70, 73, 76, 79, 616, 621, 113, 114, 115, 116, 117, 123, 124, 127, 128, 129, ... (75 characters truncated) ... 711, 200, 219, 227, 228, 234, 235, 241, 758, 771, 309, 310, 340, 343, 346, 349, 354, 365, 368, 380, 383, 384, 385, 386, 440, 447, 448, 451, 453, 474}}, {'index': 6, 'words': 'exchang', 'docids': {0, 416, 290, 354, 357, 425, 10, 302, 430, 405, 376, 415}}, {'index': 7, 'words': 'issu', 'docids': {0, 386, 419, 676, 390, 397, 302, 272, 274, 306, 722, 700, 350}} ... displaying 10 of 1969 total bound parameter sets ... {'index': 1967, 'words': '86', 'docids': {774}}, {'index': 1968, 'words': 'separ', 'docids': {774}})]
I am not able to understand the existing solution posted on Stackoverflow regarding a similar error. Someone please help me out.
This is the first time I am giving a go with scikit learn. However, I am struggling to get the closest line of best fit using the following data
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
y = [0, 187, 262, 296, 319, 340, 359, 376, 388, 401, 411, 414, 423, 430, 433, 439, 446, 452, 457, 461, 465, 469, 470, 470, 472, 474, 479, 484, 486, 487, 489, 489, 491, 491, 491, 494, 494, 498, 500, 500, 500, 500, 505, 506, 506, 506, 506, 507, 508, 509, 509, 509, 511, 511, 512, 514, 515, 515, 515, 517, 517, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 518, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519, 519]
I was able to graph the following out of of matplotlib...
...as a result of the code below...
fig, ax = plt.subplots(figsize = (10,8))
ax1 = plt.scatter(x, y, c = 'brown')
def func(x, a, b,c):
return a*np.log2(b+x)+c
popt, pcov = curve_fit(func, frequency['pct'], frequency['Facility Count Military'])
print(popt)
#popt was the following: [4.28209689e+01 1.46600585e-02 2.59467635e+02]
ax2 = sns.lineplot(frequency['pct'], popt[0]*np.log2(popt[1]+frequency['pct'])+popt[2], c = 'black')
plt.xlabel('x')
plt.ylabel('y')
plt.ylim([0, 530])
plt.xlim([0, 100])
plt.title('y over x', y = 1, fontsize=15, fontweight='semibold')
plt.show()
(a) Is my methodology correct?
(b) Does it make sense to make a line of best fit with a log based 2 line or is this something different?
Edited:
Nevermind about part c. I just edited the code accordingly and figured that out on my own.
(c) Is there a way to translate the "popt" into the line graph that will eventually be used?
Any assistance on this is truly appreciated.
I have an array all with dimensions (19494500, 376) I need to arrange these 376 columns in a particular sequence I have generated,
l
array([ 0, 94, 188, 282, 1, 95, 189, 283, 2, 96, 190, 284, 3,
97, 191, 285, 4, 98, 192, 286, 5, 99, 193, 287, 6, 100,
194, 288, 7, 101, 195, 289, 8, 102, 196, 290, 9, 103, 197,
291, 10, 104, 198, 292, 11, 105, 199, 293, 12, 106, 200, 294,
13, 107, 201, 295, 14, 108, 202, 296, 15, 109, 203, 297, 16,
110, 204, 298, 17, 111, 205, 299, 18, 112, 206, 300, 19, 113,
207, 301, 20, 114, 208, 302, 21, 115, 209, 303, 22, 116, 210,
304, 23, 117, 211, 305, 24, 118, 212, 306, 25, 119, 213, 307,
26, 120, 214, 308, 27, 121, 215, 309, 28, 122, 216, 310, 29,
123, 217, 311, 30, 124, 218, 312, 31, 125, 219, 313, 32, 126,
220, 314, 33, 127, 221, 315, 34, 128, 222, 316, 35, 129, 223,
317, 36, 130, 224, 318, 37, 131, 225, 319, 38, 132, 226, 320,
39, 133, 227, 321, 40, 134, 228, 322, 41, 135, 229, 323, 42,
136, 230, 324, 43, 137, 231, 325, 44, 138, 232, 326, 45, 139,
233, 327, 46, 140, 234, 328, 47, 141, 235, 329, 48, 142, 236,
330, 49, 143, 237, 331, 50, 144, 238, 332, 51, 145, 239, 333,
52, 146, 240, 334, 53, 147, 241, 335, 54, 148, 242, 336, 55,
149, 243, 337, 56, 150, 244, 338, 57, 151, 245, 339, 58, 152,
246, 340, 59, 153, 247, 341, 60, 154, 248, 342, 61, 155, 249,
343, 62, 156, 250, 344, 63, 157, 251, 345, 64, 158, 252, 346,
65, 159, 253, 347, 66, 160, 254, 348, 67, 161, 255, 349, 68,
162, 256, 350, 69, 163, 257, 351, 70, 164, 258, 352, 71, 165,
259, 353, 72, 166, 260, 354, 73, 167, 261, 355, 74, 168, 262,
356, 75, 169, 263, 357, 76, 170, 264, 358, 77, 171, 265, 359,
78, 172, 266, 360, 79, 173, 267, 361, 80, 174, 268, 362, 81,
175, 269, 363, 82, 176, 270, 364, 83, 177, 271, 365, 84, 178,
272, 366, 85, 179, 273, 367, 86, 180, 274, 368, 87, 181, 275,
369, 88, 182, 276, 370, 89, 183, 277, 371, 90, 184, 278, 372,
91, 185, 279, 373, 92, 186, 280, 374, 93, 187, 281, 375])
So I am doing following
all_c = all[:,l]
but I am getting
"memory error"
Can you suggest what could be the most memory-efficient way?
Rather than permute the whole array at once you can do it row by row in place. Try
for r in range(all.shape[0]):
all[r] = all[r, l]
My two data sets are:
fnamerp1=([ 93, 87, 96, 93, 90, 123, 111, 82, 87, 115, 103,
101, 93, 92, 111, 107, 114, 106, 116, 106, 128, 115,
141, 134, 120, 149, 140, 166, 152, 171, 192, 207, 227,
266, 270, 286, 355, 385, 397, 488, 462, 531, 579, 622,
711, 720, 801, 858, 906, 915, 915, 956, 1004, 1012, 1045,
1076, 1063, 1013, 985, 924, 959, 838, 766, 763, 742, 642,
587, 557, 484, 393, 353, 341, 284, 240, 221, 209, 147,
109, 113, 102, 71, 63, 63, 50, 29, 39, 36, 25,
30, 23, 27, 23, 19, 19, 24, 15, 23, 21, 26,
15])
fnamerp2=([ 105, 89, 120, 121, 103, 105, 113, 94, 104, 115, 122, 116, 121,
129, 118, 126, 138, 146, 161, 163, 178, 192, 194, 222, 268, 272,
285, 342, 380, 378, 373, 448, 493, 511, 571, 603, 691, 772, 738,
796, 839, 832, 883, 930, 963, 975, 972, 931, 947, 941, 934, 964,
871, 869, 826, 793, 733, 708, 606, 610, 515, 483, 409, 352, 358,
264, 266, 205, 191, 167, 136, 138, 99, 102, 82, 57, 65, 53,
51, 32, 26, 27, 39, 21, 29, 23, 25, 24, 16, 17, 27,
33, 19, 13, 24, 26, 18, 22, 18, 20])
I want to find the lag between the center of the two peaks (not just their max). And my plan is to use np.argmax(signal.correlate(fnamerp1,fnamerp2)).
What is the right way to do this both from a mathematical perspective and also elegant in Python?