Can i split a Dataframe by columns? - python

I need to split a Dataframe by the columns,
I made a simple code, that runs without error, but didn't give me the return i expected.
Here's the simple code:
dados = pd.read_excel(r'XXX')
for x in range(1,13):
selectmonth = x
while selectmonth < 13:
df_datas = dados.loc[dados['month'] == selectmonth]
correlacao2 = df_datas.corr().round(4).iloc[0]
else: break
print()
I did one by one by inputing the selected mouth manually like this:
dfdatas = dados.loc[dados['month'] == selectmonth]
print('\n Voce selecionou o mês: ', selectmonth)
colunas2 = list(dfdatas.columns.values)
correlacao2 = dfdatas.corr().round(4).iloc[0]
print(correlacao2)
is there some way to do this in a loop? from month 1 to 12?

With pandas, you should avoid using loops wherever possible, it is very slow. You can achieve what you want here with index slicing. I'm assuming your columns are just the month numbers, you can do this:
setting up an example df:
df = pd.DataFrame([], columns=range(15))
df:
Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Index: []
getting columns with numbers 1 to 12:
dfdatas = df.loc[:, 1:12]
Columns: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Index: []
In the future, you should include example data in your question.

Just try this:
correlacao2 = dados.corr(method='pearson').round(4)
for month in dados.columns:
print('\n Voce selecionou o mês: ', month)
result=correlacao2.loc[month]
result=pd.DataFrame(result)
print(result)
Here I have used corr() and for-loop method and converted them to DataFrame
dados is your dataframe name
If your column name is number, then rename it with month name using dados.rename(columns={'1': 'Jan','2':'Feb','3':'Mar'}). Similarly, you include other months too to rename the column names. After renaming, apply the above code to get your expected answer.
If you don't want want to rename, then use .iloc[] instead of .loc[] in above code

Related

How Can I Solve This No Duplicated 2 Column Calculation?

Hello StackOverflow People! I have some trouble here, I do some research but I still can't make it. I have two columns that are substracted from a Dataset, the columns are "# Externo" and "Nro Envio ML".
I want that the result of the code gives me only the numbers that exist in "# Externo" but no in "Nro Envio ML"
For Example:
If 41765931626 is only in "# Externo" column but no in "Nro Envio ML", I want to print that number. Also if no exist any number in "# Externo" that is not on "Nro Envio ML" I want to print some text print("No strange sales")
Here its the code I tried. Sorry for my bad english
import numpy as np
df2=df2.dropna(subset=['Unnamed: 13'])
df2 = df2[df2['Unnamed: 13'] != 'Nro. Envío']
df2['Nro Envio ML']=df2['Unnamed: 13']
dfn=df2[["# Externo","Nro Envio ML"]]
dfn1 = dfn[dfn['# Externo'] != dfn['Nro Envio ML']]
dfn1
Also with diff It gives me values that are on 'Nro Envio ML'
Link for Sample:
https://github.com/francoveracallorda/sample
I would go outside of pandas and use the python built in set and compute the difference. Here is a simplified example:
import pandas as pd
df = pd.DataFrame({
"# Externo": [3, 5, 4, 2, 1, 7, 8],
"Nro Envio ML": [4, 9, 0, 2, 1, 3, 5]
})
diff = set(df["# Externo"]) - set(df["Nro Envio ML"])
# diff contains the values that are in df["# Externo"] but not in df["Nro Envio ML"].
print(f"Weird sales: {diff}" if diff else "No strange sales")
# Output:
# Weird sales: {8, 7}
PS: If you want to stay inside pandas, you can use diff = df.loc[~df["# Externo"].isin(df["Nro Envio ML"]), "# Externo"] to compute the safe difference as a pd.Series.
You can use ~ and isin of pandas.
series1 = pd.Series([2, 4, 8, 20, 10, 47, 99])
series2= pd.Series([1, 3, 6, 4, 10, 99, 50])
series3 = pd.Series([2, 4, 8, 20, 10, 47, 99])
df = pd.concat([series1, series2,series3], axis=1)
Case 1: Number in series1 but not in series2
diff = series1[~series1.isin(series2)]
Case 2: No any number in series1 and not in series2
same = series1[~series1.isin(series3)]

convert a dataframe column from string to List of numbers

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))
Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))
I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)
I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

csv.writer - How to dynamically write columns in writerow(n1,n2,n3, ...nth)?

I am trying to write to a CSV file. I want to write three variables on a row and then write a variable number of columns.
So for example my script will do a bunch of calculations and come up with the idea that I need 12 columns.
So the 'variable' needs to contain column 0 thru 11.
How to do this dynamically?
numberofcolumns = 12
with open(f+".csv",'wb') as output_csvfile:
filewriter = csv.writer(output_csvfile)
filewriter.writerow([constant1,constant2,constant3,variable[0],...,variable[n]])
What I want is to do
filewriter.writerow([constant1, constant2, constant3, variable[0], variable[1],....,variable[11]])
However variable[11] may not be 11 it may be 8 or 10 or whatever. the length is dynamic. How can I make it so that this code will be able to output to Nth column if the function writerow() isn't defined to use *args?
What martineau pointed out in a comment is correct. writerow accepts a list, or sequence, of any length.
So you could do something like the following:
variable = range(12)
# Change your writerow line to be something like this:
filewriter.writerow([constant1,constant2,constant3] + variable)
range in this case is an example of creating a list of however-many items. range is documented here.
Notice that the above example uses + to put two sequences/lists together.
Here's an example of that from the command line/repl:
>>> variable = range(12)
>>> variable
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> ["x", "y", "z"] + variable
['x', 'y', 'z', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

DataFrame assign inside function

I have a question regarding the df assign function. When using this function i must input the column name without apostrophes. Why is this and can i circumvent it? See example below
df = pd.DataFrame(columns=['Grade'])
df['Grade'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
df_temp = df.assign(Grade='Total')
def dummy(df, _g):
# if I write grade here then I get the expected result
return df.assign(_g='Total')
# here I want grade to be assigned to total but it creats a new variable called _g
df_temp = dummy(df, 'Grade')
def dummy(df, _g):
return df.assign(**{_g: 'Total'})

Counting like elements in a list and appending list

I am trying to create a list in Python with values pulled from an active excel sheet. I want it to pull the step # value from the excel file and append it to the list while also including which number of that element it is. For example, 1_1 the first time it pulls 1, 1_2 the second time, 1_3 the third, etc. My code is as follows...
import win32com.client
xl = win32com.client.Dispatch("Excel.Application")
CellNum = xl.ActiveSheet.UsedRange.Rows.Count
Steps = []
for i in range(2,CellNum + 1): #Create load and step arrays in abaqus after importing from excel
if str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6') in Steps:
StepCount = 1
for x in Steps:
if x == str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6'):
StepCount+=1
Steps.append(str(int(xl.Cells(i,1).value))+'_'+str(StepCount))
else:
Steps.append(str(int(xl.Cells(i,1).value))+'_1')
I understand that without the excel file, the program will not run for any of you, but I was just wondering if it is some simple error that I am missing. When I run this, the StepCount does not go higher than 2 so I receive a bunch of 1_2, 2_2, 3_2, etc elements. I've posted my resulting list below.
>>> Steps
['1_1', '2_1', '3_1', '4_1', '5_1', '6_1', '7_1', '8_1', '9_1', '10_1', '11_1', '12_1',
'13_1', '14_1', '1_2', '14_2', '13_2', '12_2', '11_2', '10_2', '2_2', '3_2', '9_2',
'8_2', '7_2', '6_2', '5_2', '4_2', '3_2', '2_2', '1_2', '2_2', '3_2', '4_2', '5_2',
'6_2', '7_2', '8_2', '9_2', '10_2', '11_2', '12_2', '13_2', '14_2', '1_2', '2_2']
EDIT #1: So, if the ('_1' or '_2' or '_3' or '_4' or '_5' or '_6') will ALWAYS only use _1, is it this line of code that is messing with my counter?
if x == str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6'):
Since it is only using _1, it will only count 1_1 and not check 1_2, 1_3, 1_4, etc
EDIT #2: Now I am using the following code. My input list is also below.
from collections import defaultdict
StepsList = []
Steps = []
tracker = defaultdict(int)
for i in range(2,CellNum + 1):
StepsList.append(int(xl.Cells(i,1).value))
>>> StepsList
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 14, 13, 12, 11, 10, 2, 3, 9, 8,
7, 6, 5, 4, 3, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 2]
for cell in StepsList:
Steps.append('{}_{}'.format(cell, tracker[cell]+1)) # This is +1 because the tracker starts at 0
tracker[cell]+=1
I get the following error: ValueError: zero length field name in format from the for cell in StepsList: iteration block
EDIT #3: Got it working. For some reason it didn't like
Steps.append('{}_{}'.format(cell, tracker[cell]+1))
So I just changed it to
for cell in StepsList:
tracker[cell]+=1
Steps.append(str(cell)+'_'+str(tracker[cell]))
Thanks for all of your help!
This line:
if str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6') in Steps:
does not do what you think it does. ('_1' or '_2' or '_3' or '_4' or '_5' or '_6') will always return '_1'. It does not iterate over that series of or values looking for a match.
Without seeing expected input vs. expected output, it's hard to point you in the correct direction to actually get what you want out of your code, but likely you'll want to leverage itertools.product or one of the other combinatoric methods from itertools.
Update
Based on your comments, I think that this is a way of solving your problem. Assuming an input list of the following:
in_list = [1, 1, 1, 2, 3, 3, 4]
You can do the following:
from collections import defaultdict
tracker = defaultdict(int) # defaultdict is just a regular dict with a default value at new keys (in this case 0)
steps = []
for cell in in_list:
steps.append('{}_{}'.format(cell, tracker[cell]+1)) # This is +1 because the tracker starts at 0
tracker[cell]+=1
Result:
>>> steps
['1_1', '1_2', '1_3', '2_1', '3_1', '3_2', '4_1']
There are likely more efficient ways to do this using combinations of itertools, but this way is certainly the most straight-forward

Categories