Select values from two different dataset in python - python

i have a trouble when i'm dealing with my 2 dataset, i explain my problem:
I have 2 different dataset:
training_df = pd.read_csv('.../train.csv')
test_df = pd.read_csv('.../test.csv')
I have to take values from some columns from train.csv and take other columns in test.csv, i tried like this:
num_attrib = pd.DataFrame(training_df, columns=[0, 2, 3, 15, 16, 17, 18, 24, 32, 34, 35, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 54, 57, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 72])
cat_attrib = pd.DataFrame(training_df, columns=[1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 51, 53, 55, 56, 58, 61, 62, 63, 73, 74])
num_attrib_test = pd.DataFrame(test_df, columns=[0, 2, 3, 15, 16, 17, 18, 24, 32, 34, 35, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 54, 57, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 72])
cat_attrib_test = pd.DataFrame(test_df, columns=[1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 51, 53, 55, 56, 58, 61, 62, 63, 73, 74])
Both datasets have numerical and categorial datas. I have to select and separate categorical from numerical datas for each datasets, but my way is wrong.
I have this trouble because i have to make the Columntransformer() on training_df and test_df.
Any suggestion?
Thank you so much

You are looking for iloc. See documentation here.
num_attrib = training_df.iloc[:,[0,2,3,...,15]]
You can also slice:
#even columns
num_attrib = training_df.iloc[:, ::2]
#odd columns
num_attrib = training_df.iloc[:, 1::2]

Related

I have problem python use library Counter? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 days ago.
Improve this question
i have problem to use library Counter in python one number
Please développer help me
from collections import Counter
serie = [5, 6, 7, 8, 10, 12, 13, 25, 27, 29, 33, 37, 39, 41, 47, 56, 59, 66, 76, 78, 1, 7, 15, 16, 21, 25, 26, 28, 30, 38, 41, 48, 51, 59, 60, 65, 68, 70, 75, 79, 3, 6, 14, 15, 17, 23,
25, 27, 33, 34, 35, 38, 46, 51, 53, 58, 63, 68, 74, 77, 7, 9, 11, 21, 26, 27, 32, 35, 38, 43, 44, 52, 53, 56, 59, 65, 66, 74, 76, 80, 3, 9, 19, 27, 28, 34, 35, 39, 47, 49, 50, 51, 53, 57, 61, 66, 67, 72, 74, 80, 2, 3, 24, 25, 28, 30, 35, 36, 51, 54, 55, 57, 61, 67, 68, 69, 70, 71, 74, 79, 3, 11, 14, 16, 19, 25, 27, 33, 35, 38, 44, 46, 48, 58, 63, 64, 65, 68, 69, 73, 7, 12, 18, 23, 24, 25, 27, 28, 47, 52, 53, 59, 65, 66, 67, 68, 69, 70, 72, 75, 1, 2, 5, 8, 9, 10, 13, 20, 25, 28, 29, 33, 39, 41, 43, 48, 49, 53, 66, 74, 1, 6, 7, 9, 15, 18, 19, 23, 25, 26, 33, 34, 42, 45, 46, 62, 65, 71, 79, 80, 2, 4, 6, 7, 11, 12, 15,
21, 23, 24, 26, 33, 34, 38, 51, 53, 67, 68, 73, 79, 1, 8, 9, 19, 20, 24, 30, 32, 35, 40,
42, 44, 47, 54, 55, 56, 60, 61, 78, 80]
# Compter le nombre d'occurrences de chaque élément dans la série
occurrences = Counter(serie)
# Trier les éléments par ordre décroissant du nombre d'occurrences
sorted_occurrences = occurrences.most_common()
# Récupérer les éléments les plus fréquents
most_common_count = sorted_occurrences[0][1]
most_common = [x[0] for x in sorted_occurrences if x[1] == most_common_count][:5]
print(most_common)
I want this code to return the five most frequent numbers while it returns
You are already doing the correct thing:
from collections import Counter
serie = [5, 6, 7, 8, 10, 12, 13, 25, 27, 29, 33, 37, 39, 41, 47, 56, 59, 66, 76, 78, 1, 7, 15, 16, 21, 25, 26, 28, 30, 38, 41, 48, 51, 59, 60, 65, 68, 70, 75, 79, 3, 6, 14, 15, 17, 23,
25, 27, 33, 34, 35, 38, 46, 51, 53, 58, 63, 68, 74, 77, 7, 9, 11, 21, 26, 27, 32, 35, 38, 43, 44, 52, 53, 56, 59, 65, 66, 74, 76, 80, 3, 9, 19, 27, 28, 34, 35, 39, 47, 49, 50, 51, 53, 57, 61, 66, 67, 72, 74, 80, 2, 3, 24, 25, 28, 30, 35, 36, 51, 54, 55, 57, 61, 67, 68, 69, 70, 71, 74, 79, 3, 11, 14, 16, 19, 25, 27, 33, 35, 38, 44, 46, 48, 58, 63, 64, 65, 68, 69, 73, 7, 12, 18, 23, 24, 25, 27, 28, 47, 52, 53, 59, 65, 66, 67, 68, 69, 70, 72, 75, 1, 2, 5, 8, 9, 10, 13, 20, 25, 28, 29, 33, 39, 41, 43, 48, 49, 53, 66, 74, 1, 6, 7, 9, 15, 18, 19, 23, 25, 26, 33, 34, 42, 45, 46, 62, 65, 71, 79, 80, 2, 4, 6, 7, 11, 12, 15,
21, 23, 24, 26, 33, 34, 38, 51, 53, 67, 68, 73, 79, 1, 8, 9, 19, 20, 24, 30, 32, 35, 40,
42, 44, 47, 54, 55, 56, 60, 61, 78, 80]
# Compter le nombre d'occurrences de chaque élément dans la série
occurrences = Counter(serie)
# Trier les éléments par ordre décroissant du nombre d'occurrences
sorted_occurrences = occurrences.most_common()
print([x[0] for x in sorted_occurrences][:5])
#output
[25, 7, 27, 33, 68]

How do you split an array into specific intervals in Num.py for Python?

The question follows a such:
x = np.arange(100)
Write Python code to split the following array at these intervals: 10, 25, 45, 75, 95
I have used the split function and unable to get at these specific intervals, can anyone enlighten me on another method or am i doing it wrongly?
Here's both the manual way and the numpy way with split.
# Manual method
x = np.arange(100)
split_indices = [10, 25, 45, 75, 95]
split_arrays = []
for i, j in zip([0]+split_indices[:-1], split_indices):
split_arrays.append(x[i:j])
print(split_arrays)
# Numpy method
split_arrays_np = np.split(x, split_indices)
print(split_arrays_np)
And the result is (for both)
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]),
array([25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]),
array([45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74]),
array([75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94])
]

Transform a list of ranges into a single list

I have a data frame that have some points to mark another dataset.
I'm creating a range from the starting mark and the stopping mark that I want to transform into a single list or numpy array.
I have the following:
list(map(lambda limits : np.arange(limits[1] - limits[0]-1, -1, -1),
zip(df_cycles['Start_point'], df_cycles['Stop_point']))
)
This is returning a list of arrays:
[array([1155, 1154, 1153, ..., 2, 1, 0]),
array([71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55,
54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,
37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21,
20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4,
3, 2, 1, 0]),
...]
How can I modify or transform the output to have a single list or NumPy array like this:
array([1155, 1154, 1153, ..., 2, 1, 0, 71, 70, 69, 68, 67, 66, 65,
64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48,
47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31,
30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14,
13, 12, 11, 10, 9, 8, 7, 6, 5, 4,3, 2, 1, 0,...])
Just do:
flatarray = np.concatenate(list_of_arrays)
concatenate puts together two or more arrays into a single new array; you don't to do it a single array at a time (it creates a Schlemiel the Painter's algorithm), but once you've got them all, it's an efficient way to combine them.

What exactly does adding more bins into `np.histogram` do?

What exactly does adding more bins into np.histogram(data, bins=100) do? I know that it divides the data into the amount of bins you specify but what exactly does that entail? For example, I have a histogram and I plotted a best fit line to the histogram using scipy.curve_fit and when I increased the bins, it also increased the accuracy for my best fit line.
The following function illustrates the difference using matplotlib. The same data is plotted using 5 bins and 10 bins:
import matplotlib.pyplot as plt
def plot_histogram(num_bins):
x = [1, 1, 2, 3, 3, 5, 7, 8, 9, 10,
10, 11, 11, 13, 13, 15, 16, 17, 18, 18,
18, 19, 20, 21, 21, 23, 24, 24, 25, 25,
25, 25, 26, 26, 26, 27, 27, 27, 27, 27,
29, 30, 30, 31, 33, 34, 34, 34, 35, 36,
36, 37, 37, 38, 38, 39, 40, 41, 41, 42,
43, 44, 45, 45, 46, 47, 48, 48, 49, 50,
51, 52, 53, 54, 55, 55, 56, 57, 58, 60,
61, 63, 64, 65, 66, 68, 70, 71, 72, 74,
75, 77, 81, 83, 84, 87, 89, 90, 90, 91
]
plt.hist(x, bins=num_bins)
plt.title(f'{num_bins} bins')
plt.show()
plot_histogram(5)
plot_histogram(10)
Above, there are 30 data points that have a value between 20 and 40.
Above, you have more detail. There are 19 data points between 20 and 30 and 11 data points between 30 and 40.

Reading formatted array from file in Python

I have a file which contains some strings and then two formatted arrays. It looks something like this
megabuck
Hello world
[58, 50, 42, 34, 26, 18, 10, 2,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9,
1, 58, 50, 42, 34, 26, 18,
14, 6, 61, 53, 45, 37, 29,
21, 13, 5, 28, 20, 12, 4]
I don't know the size of the arrays beforehand. Only thing I know is the delimiter for the array which is []. What can be an elegant way to read the arrays.
I am a newbie in python.
Using Regex. re.findall
Ex:
import re
import ast
with open(filename) as infile:
data = infile.read()
for i in re.findall(r"(\[.*?\])", data, flags=re.S):
print(ast.literal_eval(i))
Output:
[58, 50, 42, 34, 26, 18, 10, 2, 61, 53, 45, 37, 29, 21, 13, 5, 63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9, 1, 58, 50, 42, 34, 26, 18, 14, 6, 61, 53, 45, 37, 29, 21, 13, 5, 28, 20, 12, 4]
I wouldn't call it elegant but it works
ars = """
megabuck
Hello world
[58, 50, 42, 34, 26, 18, 10, 2,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9,
1, 58, 50, 42, 34, 26, 18,
14, 6, 61, 53, 45, 37, 29,
21, 13, 5, 28, 20, 12, 4]
"""
arrays = []
for a in ars.split("["):
if ']' in a:
arrays.append([i.strip() for i in a.replace("]",'').split(',')])

Categories