Getting column widths from string in python - python

I have the foll. string:
' Y M D PDSW RSPC NPPC NEE'
Each element in the string corresponds to a column in a csv file. Is there a way (aside from for loops), of getting the width of each column from this string? E.g. Th first column has a width of 5 (' Y'), the next has a width of 4(' M')...

Maybe something like:
>>> text = ' Y M D PDSW RSPC NPPC NEE'
>>> cols = re.findall('\s+\S+', text)
>>> [len(col) for col in cols]
[5, 4, 4, 9, 10, 10, 10]
So - assume the columns (right-aligned) are one or more spaces followed by one or more non-space, then take the lengths of the resulting strings.

Related

Python How to make a proper string slicing?

I can't figure out how to properly slice a string.
There is a line: "1, 2, 3, 4, 5, 6". The number of characters is unknown, numbers can be either one-digit or three-digit
I need to get the last value up to the nearest comma, that means I need to get the value (6) from the string
you can try to split and get last value
string = "1, 2, 3, 4, 5, 6"
string.split(',')[-1]
>>> ' 6'
add strip to get rid of the white spaces
string.split(',')[-1].strip(' ')
>>> '6'
Better use str.rsplit, setting maxsplit=1 to avoid unnecessarily splitting more than once:
string = "1, 2, 3, 4, 5, 6"
last = string.rsplit(', ', 1)[-1]
Output: '6'
It seems to me the easiest way would be to use the method split and divide your string based on the comma.
In your example:
string = '1, 2, 3, 4, 5, 6'
last_value = string.split(', ')[-1]
print(last_value)
Out[3]: '6'
Here's a function that should do it for you:
def get_last_number(s):
return s.split(',')[-1].strip()
Trying it on a few test strings:
s1 = "1, 2, 3, 4, 5, 6"
s2 = "123, 4, 785, 12"
s3 = "1, 2, 789654 "
...we get:
print (get_last_number(s1))
# 6
print (get_last_number(s2))
# 12
print (get_last_number(s3))
# 789654
First of all you have to split the string:
string = '1, 2, 3, 4, 5, 6'
splitted_str = string.split(',')
Then, you should get the last element:
last_elem = splitted_str[-1]
Finally, you have to delete the unnecessary white spaces:
last_number_str = last_elem.strip()
Clearly this answer is a string type, if you need the numeric value you can cast the type by using
last_elem_int = int(last_elem_str)
Hope that helps

Can i split a Dataframe by columns?

I need to split a Dataframe by the columns,
I made a simple code, that runs without error, but didn't give me the return i expected.
Here's the simple code:
dados = pd.read_excel(r'XXX')
for x in range(1,13):
selectmonth = x
while selectmonth < 13:
df_datas = dados.loc[dados['month'] == selectmonth]
correlacao2 = df_datas.corr().round(4).iloc[0]
else: break
print()
I did one by one by inputing the selected mouth manually like this:
dfdatas = dados.loc[dados['month'] == selectmonth]
print('\n Voce selecionou o mês: ', selectmonth)
colunas2 = list(dfdatas.columns.values)
correlacao2 = dfdatas.corr().round(4).iloc[0]
print(correlacao2)
is there some way to do this in a loop? from month 1 to 12?
With pandas, you should avoid using loops wherever possible, it is very slow. You can achieve what you want here with index slicing. I'm assuming your columns are just the month numbers, you can do this:
setting up an example df:
df = pd.DataFrame([], columns=range(15))
df:
Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Index: []
getting columns with numbers 1 to 12:
dfdatas = df.loc[:, 1:12]
Columns: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Index: []
In the future, you should include example data in your question.
Just try this:
correlacao2 = dados.corr(method='pearson').round(4)
for month in dados.columns:
print('\n Voce selecionou o mês: ', month)
result=correlacao2.loc[month]
result=pd.DataFrame(result)
print(result)
Here I have used corr() and for-loop method and converted them to DataFrame
dados is your dataframe name
If your column name is number, then rename it with month name using dados.rename(columns={'1': 'Jan','2':'Feb','3':'Mar'}). Similarly, you include other months too to rename the column names. After renaming, apply the above code to get your expected answer.
If you don't want want to rename, then use .iloc[] instead of .loc[] in above code

How to find the first and last of one of several characters in a list of strings?

So I have a list of strings such as this:
my_list=["---abcdefgh----abc--","--abcd-a--","----------abcdefghij----ab-","-abcdef---a-","----abcdefghijklm----abc--"]
I want, for each string, to retrieve the position where the first and last letters appear. Or in other words, to find the position of the first character that isn't a "-" and the position of the last character that isn't a "-". It would be perfect if I could save the result as two lists, one of the first positions and another for the last.
I've tried using find() at least for the first position but since the character I'm trying to find is one of several letters, I don't know how to do it.
The output I wanted was something like this:
first_positions=[3,2,10,1,4]
last_positions=[17,7,25,11,23]
Thanks in advance for any answer
Here is an implementation without using regex.
my_list=["---abcdefgh----abc--","--abcd-a--","----------abcdefghij----ab-","-abcdef---a-","----abcdefghijklm----abc--"]
def find_i(word):
first = None
last = None
for i, letter in enumerate(word):
if first == None:
if letter != '-':
first = i
else:
if letter != '-':
last = i
return (first, last)
r = list(map(find_i, my_list))
print(r) #I like this output more, but it is up to you.
first_positions = [i[0] for i in r]
last_positions = [i[1] for i in r]
print(first_positions)
print(last_positions)
Output:
[(3, 17), (2, 7), (10, 25), (1, 10), (4, 23)]
[3, 2, 10, 1, 4]
[17, 7, 25, 10, 23]
There is possibly a nicer way to do this, but one way to get it is to match all non-hyphen characters and get the start index of that match, and then to match all non-hyphen characters which are followed by 0 or more hyphens and then the end of the line, and get the start index of that match, and compile them into li
>>> import re
>>> [re.search(r'[^-]+', string).start() for string in my_list]
[3, 2, 10, 1, 4]
>>> [re.search(r'[^-]-*$', string).start() for string in my_list]
[17, 7, 25, 10, 23]
Check string is alphabet or not and then add index of string in the list. if you want to find first positions then by default function takes first then it returns the first indexes of string else it returns the last indexes
my_list=["---abcdefgh----abc--","--abcd-a--","----------abcdefghij----ab-","-abcdef---a-","----abcdefghijklm----abc--"]
def get_index(string,find='first'):
index_lt=[]
for idx,char in enumerate(string):
if char.isalpha():
index_lt.append(idx)
return index_lt[0] if find=='first' else index_lt[-1]
print([get_index(string) for string in my_list])
#[3, 2, 10, 1, 4]
print([get_index(string,find='last') for string in my_list])
#[17, 7, 25, 10, 23]

convert a dataframe column from string to List of numbers

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))
Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))
I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)
I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

How to use np.where() to create a new array of specific rows?

I have an array (msaarr) of 1700 values, ranging from approximately 0 to 150. I know that 894 of these values should be less than 2, and I wish to create a new array containing only these values.
So far, I have attempted this code:
Combined = np.zeros(shape=(894,8))
for i in range(len(Spitzer)): #len(Spitzer) = 1700
index = np.where(msaarr <= 2)
Combined[:,0] = msaarr[index]
The reason there are eight columns is because I have more data associated with each value in msaarr that I also want to display. msaarr was created using several lines of code, which is why I haven't mentioned them here, but it is an array with shape (1700,1) with type float64.
The problem I'm having is that if I print msaarr[index], then I get an array of shape (893,), but when I attempt to assign this as my zeroth column, I get the error
ValueError: could not broadcast input array from shape (1699) into shape (894)
I also attempted
Combined[:,0] = np.extract(msaarr <= 2, msaarr)
Which gave the same error.
I thought at first this might just be some confusion with Python's zero-indexing, so I tried changing the shape to 893, and also tried to assign to a different column Combined[:,1], but I have the same error every time.
Alternatively, when I try:
Combined[:,1][i] = msaarr[index][i]
I get the error:
IndexError: index 894 is out of bounds for axis 0 with size 894
What am I doing wrong?
EDIT:
A friend pointed out that I might not be calling index correctly because it is a tuple, and so his suggestion was this:
index = np.where(msaarr < 2)
Combined[:,0] = msaarr[index[0][:]]
But I am still getting this error:
ValueError: could not broadcast input array from shape (893,1) into shape (893)
How can my shape be (893) and not (893, 1)?
Also, I did check, and len(index[0][:]) = 893, and len(msaarr[index[0][:]]) = 893.
The full code as of last attempts is:
import numpy as np
from astropy.io import ascii
from astropy.io import fits
targets = fits.getdata('/Users/vcolt/Dropbox/ATLAS source matches/OzDES.fits')
Spitzer = ascii.read(r'/Users/vcolt/Desktop/Catalogue/cdfs_spitzer.csv', header_start=0, data_start=1)
## Find minimum separations, indexed.
RADiffArr = np.zeros(shape=(len(Spitzer),1))
DecDiffArr = np.zeros(shape=(len(Spitzer),1))
msaarr = np.zeros(shape=(len(Spitzer),1))
Combined= np.zeros(shape=(893,8))
for i in range(len(Spitzer)):
x = Spitzer["RA_IR"][i]
y = Spitzer["DEC_IR"][i]
sep = abs(np.sqrt(((x - targets["RA"])*np.cos(np.array(y)))**2 + (y - targets["DEC"])**2))
minsep = np.nanmin(sep)
minseparc = minsep*3600
msaarr[i] = minseparc
min_positions = [j for j, p in enumerate(sep) if p == minsep]
x2 = targets["RA"][min_positions][0]
RADiff = x*3600 - x2*3600
RADiffArr[i] = RADiff
y2 = targets["DEC"][min_positions][0]
DecDiff = y*3600 - y2*3600
DecDiffArr[i] = DecDiff
index = np.where(msaarr < 2)
print msaarr[index].shape
Combined[:,0] = msaarr[index[0][:]]
I get the same error whether index = np.where(msaarr < 2) is in or out of the loop.
Take a look at using numpy.take in combination with numpy.where.
inds = np.where(msaarr <= 2)
new_msaarr = np.take(msaarr, inds)
If it is a multi-dimensional array, you can also add the axis keyword to take slices along that axis.
I think loop is not at the right place. np.where() will return an array of index of elements which matches the condition you have specified.
This should suffice
Index = np.where(msaarr <= 2)
Since index is an array. You need to loop over this index and fill the values in combined[:0]
Also I want to point out one thing. You have said that there will be 894 values less than 2 but in the code you are using less than and equal to 2.
np.where(condition) will return a tuple of arrays containing the indexes of elements that verify your condition.
To get an array of the elements verifying your condition use:
new_array = msaarr[msaarr <= 2]
>>> x = np.random.randint(0, 10, (4, 4))
>>> x
array([[1, 6, 8, 4],
[0, 6, 6, 5],
[9, 6, 4, 4],
[9, 6, 8, 6]])
>>> x[x>2]
array([6, 8, 4, 6, 6, 5, 9, 6, 4, 4, 9, 6, 8, 6])

Categories