How Can I Solve This No Duplicated 2 Column Calculation? - python

Hello StackOverflow People! I have some trouble here, I do some research but I still can't make it. I have two columns that are substracted from a Dataset, the columns are "# Externo" and "Nro Envio ML".
I want that the result of the code gives me only the numbers that exist in "# Externo" but no in "Nro Envio ML"
For Example:
If 41765931626 is only in "# Externo" column but no in "Nro Envio ML", I want to print that number. Also if no exist any number in "# Externo" that is not on "Nro Envio ML" I want to print some text print("No strange sales")
Here its the code I tried. Sorry for my bad english
import numpy as np
df2=df2.dropna(subset=['Unnamed: 13'])
df2 = df2[df2['Unnamed: 13'] != 'Nro. Envío']
df2['Nro Envio ML']=df2['Unnamed: 13']
dfn=df2[["# Externo","Nro Envio ML"]]
dfn1 = dfn[dfn['# Externo'] != dfn['Nro Envio ML']]
dfn1
Also with diff It gives me values that are on 'Nro Envio ML'
Link for Sample:
https://github.com/francoveracallorda/sample

I would go outside of pandas and use the python built in set and compute the difference. Here is a simplified example:
import pandas as pd
df = pd.DataFrame({
"# Externo": [3, 5, 4, 2, 1, 7, 8],
"Nro Envio ML": [4, 9, 0, 2, 1, 3, 5]
})
diff = set(df["# Externo"]) - set(df["Nro Envio ML"])
# diff contains the values that are in df["# Externo"] but not in df["Nro Envio ML"].
print(f"Weird sales: {diff}" if diff else "No strange sales")
# Output:
# Weird sales: {8, 7}
PS: If you want to stay inside pandas, you can use diff = df.loc[~df["# Externo"].isin(df["Nro Envio ML"]), "# Externo"] to compute the safe difference as a pd.Series.

You can use ~ and isin of pandas.
series1 = pd.Series([2, 4, 8, 20, 10, 47, 99])
series2= pd.Series([1, 3, 6, 4, 10, 99, 50])
series3 = pd.Series([2, 4, 8, 20, 10, 47, 99])
df = pd.concat([series1, series2,series3], axis=1)
Case 1: Number in series1 but not in series2
diff = series1[~series1.isin(series2)]
Case 2: No any number in series1 and not in series2
same = series1[~series1.isin(series3)]

Related

Applying an iterable mask, checking it against a value - if value doesn't satisfy the mask condition, move to the next value which does

I currently have some code where I've created a mask which checks to see if a variable matches the first position in a sequence, called index_pos_overload. If it matches, the variable is chosen, and the check ends. However, I want to be able to use this mask to not only check if the number satisfies the condition of the mask, but if it doesn't move along to the next value in the sequence which does. It's essentially to pick out a row in my pandas data column, hyst. My code currently looks like this:
import pandas as pd
from itertools import chain
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
possible_overload_cycle = 1
index_pos_overload = chain.from_iterable((hyst.index[i])
for i in range(0, len(hyst)-1, 5))
if (possible_overload_cycle == index_pos_overload):
hyst_overload_cycle = possible_overload_cycle
else:
hyst_overload_cycle = 5 #next value in iterable where index_pos_overload is true
The expected output of hyst_overload_cycle should be this:
print(hyst_overload_cycle)
5
I've included my logic as to how I think this should work - possible_overload_cycle = 1 does not point to the first position in the dataframe, so hyst_overload_cycle should return as 5, the first position in the mask. I hope I've made sense, as I can't quite seem to work out how I would go about this programatically.
If I understood you correctly, it may be simpler than you think:
index_pos_overload can be an array / list, there is no need to use complex constructs to store a sequence of values
to find the first non-zero value from index_pos_overload, one can simply use np.nonzero()[0][0] (the first [0] is to select the dimension, the second is to select the index within that axis) and use array indexing of that on the original index_pos_overload array
The code would look like:
import numpy as np
import pandas as pd
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
possible_overload_cycle = 1
index_pos_overload = np.array([hyst.index[i] for i in range(0, len(hyst)-1, 5)])
if possible_overload_cycle in index_pos_overload:
hyst_overload_cycle = possible_overload_cycle
else:
hyst_overload_cycle = index_pos_overload[np.nonzero(index_pos_overload)[0][0]]
print(hyst_overload_cycle)
# 5

Can i split a Dataframe by columns?

I need to split a Dataframe by the columns,
I made a simple code, that runs without error, but didn't give me the return i expected.
Here's the simple code:
dados = pd.read_excel(r'XXX')
for x in range(1,13):
selectmonth = x
while selectmonth < 13:
df_datas = dados.loc[dados['month'] == selectmonth]
correlacao2 = df_datas.corr().round(4).iloc[0]
else: break
print()
I did one by one by inputing the selected mouth manually like this:
dfdatas = dados.loc[dados['month'] == selectmonth]
print('\n Voce selecionou o mês: ', selectmonth)
colunas2 = list(dfdatas.columns.values)
correlacao2 = dfdatas.corr().round(4).iloc[0]
print(correlacao2)
is there some way to do this in a loop? from month 1 to 12?
With pandas, you should avoid using loops wherever possible, it is very slow. You can achieve what you want here with index slicing. I'm assuming your columns are just the month numbers, you can do this:
setting up an example df:
df = pd.DataFrame([], columns=range(15))
df:
Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Index: []
getting columns with numbers 1 to 12:
dfdatas = df.loc[:, 1:12]
Columns: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Index: []
In the future, you should include example data in your question.
Just try this:
correlacao2 = dados.corr(method='pearson').round(4)
for month in dados.columns:
print('\n Voce selecionou o mês: ', month)
result=correlacao2.loc[month]
result=pd.DataFrame(result)
print(result)
Here I have used corr() and for-loop method and converted them to DataFrame
dados is your dataframe name
If your column name is number, then rename it with month name using dados.rename(columns={'1': 'Jan','2':'Feb','3':'Mar'}). Similarly, you include other months too to rename the column names. After renaming, apply the above code to get your expected answer.
If you don't want want to rename, then use .iloc[] instead of .loc[] in above code

convert a dataframe column from string to List of numbers

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))
Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))
I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)
I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

DataFrame assign inside function

I have a question regarding the df assign function. When using this function i must input the column name without apostrophes. Why is this and can i circumvent it? See example below
df = pd.DataFrame(columns=['Grade'])
df['Grade'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
df_temp = df.assign(Grade='Total')
def dummy(df, _g):
# if I write grade here then I get the expected result
return df.assign(_g='Total')
# here I want grade to be assigned to total but it creats a new variable called _g
df_temp = dummy(df, 'Grade')
def dummy(df, _g):
return df.assign(**{_g: 'Total'})

Transforming pandas Dataframe into dictionary via function taking column inputs

I have the following pandas Dataframe:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
df1 = pd.DataFrame(dict1)
print(df1)
file amount front back
0 filename2 3 21889611 21973805
1 filename2 4 36357723 36403870
2 filename3 5 196312 277500
3 filename4 1 11 19
4 filename4 2 42 120
5 filename3 1 1992 3210
My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.
To do this on an row-by-row basis is easy for me to understand:
e.g. row 1
import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)
e.g. row 2
random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)
Normally with pandas, users could define this as a function and use something like
def func(front, back, amount):
return np.random.choice(np.arange(front, back+1), amount)
df["new_column"].apply(func)
but the result of my function is an array of varying size.
My second problem is that I would like the output to be a dictionary, of the format
{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}
For the above example df1, the function should output this dictionary (given the draws):
final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833],
"filename3": 212143, 239725, 240959, 197359, 276948, 3199],
"filename4": [100, 83, 15]}
We can pass axis=1 to operate over rows when using apply.
We then need to tell what columns to use and we return a list.
We then either perform some form of groupby or we could use defaultdict as shown below:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
import numpy as np
import pandas as pd
def func(x):
return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()
df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)
df1.groupby('file')['new_column'].apply(sum).to_dict()
Returns:
{'filename2': [21891765,
21904680,
21914414,
36398355,
36358161,
36387670,
36369443],
'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
'filename4': [18, 96, 107]}
Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):
from collections import defaultdict
d = defaultdict(list)
for k,v in df1.set_index('file')['new_column'].items():
d[k].extend(v)

Categories