I am new to python and have recently learnt to create a series in python using Pandas. I can define a series eg: x = pd.Series([1, 2, 3, 4, 5]) but how to define the series for a range, say 1 to 100 rather than typing all elements from 1 to 100?
As seen in the docs for pandas.Series, all that is required for your data parameter is an array-like, dict, or scalar value. Hence to create a series for a range, you can do exactly the same as you would to create a list for a range.
one_to_hundred = pd.Series(range(1,101))
one_to_hundred=pd.Series(np.arange(1,101,1))
This is the correct answer where you create a series using the numpy arange function which creates a range starting with 1 till 100 by incrementing 1.
There's also this:
one_to_hundred = pd.RangeIndex(1, 101).to_series()
I'm still looking for a pandas function that creates a series containing a range (sequence) of numbers directly, but I don't think it exists.
try pd.Series([0 for i in range(20)]).
It will create a pd series with 20 rows
num=np.arange(1,101)
s = pd.Series(num)
see the solution just change whatever you want. and for details about np.arange see
below link
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html
Related
I imported my csv file and I want to use variable in my loc code like\
a = 0; b = 1
df.loc[int(a+2),int(2*b-2)]
but it has an error
how do I fix it?
if you want to get the value from position (a+2, 2*b-2), use .iloc instead.
.loc is used to select data from index name.
try this:
df.iloc[a+2,2*b-2]
pandas docs
It seems that you want to use a slice of int numbers to select a piece of data. In pandas.DataFrame, you can use df.loc and df.iloc for indexing.
Allowed inputs of df.loc are:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'].
A slice object with labels, e.g. 'a':'f'.
A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
An alignable boolean Series.
An alignable Index.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
Allowed inputs of df.iloc are:
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
But both of them don't support two integers for inputing.
So maybe you should use df.iloc to accpet a list of integers df.iloc[[int(a+2),int(2*b-2)]], instead of using df.loc df.loc[int(a+2),int(2*b-2)].
For more details, you can see the documents:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
I am trying to use a dictionary value to define the slice ranges for the iloc function but I keep getting the error -- Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] . The excel sheet is built for visual information and not in any kind of real table format (not mine so I can’t change it) so I have to slice the specific ranges without column labels.
tried code - got the error
cr_dict= {'AA':'[42:43,32:65]', 'BB':'[33:34, 32:65]'}
df = my_df.iloc[cr_dict['AA']]
the results I want would be similar to
df = my_df.iloc[42:43,32:65]
I know I could change the dictionary and use the following but it looks convoluted and not as easy to read– is there a better way?
Code
cr_dict= {'AA':[42,43,32,65], 'BB':'[33,34, 32,65]'}
df = my_df.iloc[cr_dict['AA'][0]: cr_dict['AA'][0], cr_dict['AA'][0]: cr_dict['AA'][0]]
Define your dictionaries slightly differently.
cr_dict= {'AA':[42,43]+list(range(32,65)),
'BB':[33,34]+list(range(32,65))}
Then you can slice your DataFrame like so:
>>> my_df.iloc[cr_dict["AA"], cr_dict["BB"]].sort_index()
I have a dataframe extracted with Pandas for which one of the colums looks something like this:
What I want to do is to extract the numerical values (floats) in this column, which by itself I could do. The issue comes because I have some cells, like the cell 20 in the image, in which I have more than one number, so I would like to make an average of these values. I think that for that I would first need to recognize the different groups of numerical values in the string (each float number) and then extract them as floats to then operate with them. I don't know how to do this.
Edit: I have found an solution to this using the re.findall command from regex. This is based on the answer of a question in this thread Find all floats or ints in a given string.
for index,value in z.iteritems():
z[index]=statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',value)])
Note that I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
However, I get a warning with this approach, due to the loop (there is no warning when I do it only for one element of the series):
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Although I don't see any issue happening with my data, is this warning important?
I think you can benefit from the Pandas vectorized operations here. Use findall over the original dataframe and apply in sequence the pd.Series to transform from list to columns and pd.to_numeric to convert from string to numeric type (default return dtype is float64). Then calculate the average of the values on each row with .mean(axis=1).
import pandas as pd
d = {0: {0: '2.469 (VLT: emission host)',
1: '1.942 (VLT: absorption)',
2: '1.1715 (VLT: absorption)',
3: '0.42 (NOT: absorption)|0.4245 (GTC)|0.4250 (ESO-VLT UT2: absorption & emission)',
4: '3.3765 (VLT: absorption)',
5: '1.86 (Xinglong: absorption)| 1.86 (GMG: absorption)|1.859 (VLT: absorption)',
6: '<2.4 (NOT: inferred)'}}
df = pd.DataFrame(d)
print(df)
s_mean = df[0].str.findall(r'(?:\b\d{1,2}\b(?:\.\d*))')\
.apply(pd.Series)\
.apply(pd.to_numeric)\
.mean(axis=1)
print(s_mean)
Output from s_mean
0 2.469000
1 1.942000
2 1.171500
3 0.423167
4 3.376500
5 1.859667
6 2.400000
I have found a solution based on what I wrote previously in the Edit of the original post:
It consists on using the re.findall() command with regex, as posted in this thread Find all floats or ints in a given string:
statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)])
Then, to loop over the dataframe column, just use the lambda x: method with the pandas apply command (df.apply). For this, I have defined a function (redshift_to_num) executing the operation above, and then apply this function to each element in the dataframe column:
import re
import pandas as pd
import statistics
def redshift_to_num(string):
measures=[float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)]
mean=statistics.mean(measures)
return mean
df.Redshift=df.Redshift.apply(lambda x: redshift_to_num(x))
Notes:
The data of interest in my case is stored in the dataframe column df.Redshift.
In the re.findall command I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
So I have a pandas DataFrame that has several columns that contain values I'd like to use to create new columns using a function I've defined. I'd been planning on doing this using Python's List Comprehension as detailed in this answer. Here's what I'd been trying:
df['NewCol1'], df['NewCol2'] = [myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])]
This runs correctly until it comes time to assign the values to the new columns, at which point it fails, I believe because it hasn't been iteratively assigning the values and instead tries to assign a constant value to each column. I feel like I'm close to doing this correctly, but I can't quite figure out the assignment.
EDIT:
The data are all strings, and the function performs a fetching of some different information from another source based on those strings like so:
def myFunction(x, y):
# read file based on value of x
# search file for values a and b based on value of y
return(a, b)
I know this is a little vague, but the helper function is fairly complicated to explain.
The error received is:
ValueError: too many values to unpack (expected 4)
You can use zip()
df['NewCol1'], df['NewCol2'] = zip(*[myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])])
I have written code where I have a dataframe containing all types of food. I have then split it into fruit and veg series using str.contains. I have written code where I append any food which is common in both series to a list:
fruit = fruit_2.tolist()#converting the series to a list
veg = veg_2.tolist()#converting the series to a list
for x in range (len(fruit)):
for y in range (len(veg)):
if fruit[x] == veg[y]:
both.append(fruit[x])
print(both)
This works just wondering if someone has a solution which utilised pandas and doesn't use a for loop.
Thanks
Try this:
fruit_2[fruit_2.isin(veg_2)]
This will give you the common elements.
you could use np.intersect1d
import numpy as np
# gives a numpy array (you can later convert to series or list)
both = np.intersect1d(fruit_2, veg_2)
courtesy of this answer