Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
t, p = zip(*out)
but it t is
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
maybe this is most strightforward (most pythonic i guess):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
out.index.name=None
maybe:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
I believe you want this:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of #CT ZHU and that of #Siraj S.
Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of #CT ZHU performs much faster when the size of input grows.
Example:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,
>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)
Thus, you could do this,
>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
t r
0 -1.776982 0.088249
1 0.105437 0.916905
2 -1.720683 0.098178
3 1.006216 0.324349
Related
I was able to solve the problem described below, but as I am a newbie, I am not sure if my solution is good. I'd be grateful for any tips on how to do it in a more efficient and/or more elegant manner.
What I have:
...and so on (the table's quite big).
What I need:
How I solved it:
Load the file
df = pd.read_csv("survey_data_cleaned_ver2.csv")
Define a function
def transform_df(df, list_2, column_2, list_1, column_1='Respondent'):
for ind in df.index:
elements = df[column_2][ind].split(';')
num_of_elements = len(elements)
for num in range(num_of_elements):
list_1.append(df['Respondent'][ind])
for el in elements:
list_2.append(el)
Dropna because NaNs are floats and that was causing errors later on.
df_LanguageWorkedWith = df[['Respondent', 'LanguageWorkedWith']]
df_LanguageWorkedWith.dropna(subset='LanguageWorkedWith', inplace=True)
Create empty lists
Respondent_As_List = []
LanguageWorkedWith_As_List = []
Call the function
transform_df(df_LanguageWorkedWith, LanguageWorkedWith_As_List, 'LanguageWorkedWith', Respondent_As_List)
Tranform the lists into dataframes
df_Respondent = pd.DataFrame(Respondent_As_List, columns=["Respondent"])
df_LanguageWorked = pd.DataFrame(LanguageWorkedWith_As_List, columns=["LanguageWorkedWith"])
Concatenate those dataframes
df_LanguageWorkedWith_final = pd.concat([df_Respondent, df_LanguageWorked], axis=1)
And that's it.
The code and input file can be found on my GitHub: https://github.com/jarsonX/Temp_files
Thanks in advance!
You can try like this. I haven't tested but it should work
df['LanguageWorkedWith'] = df['LanguageWorkedWith'].str.replace(';',',')
df =df.assign(LanguageWorkedWith=df['LanguageWorkedWith'].str.split(',')).explode('LanguageWorkedWith')
#Tested
LanguageWorkedWith Respondent
0 C 4
0 C++ 4
0 C# 4
0 Python 4
0 SQL 4
... ... ...
10319 Go 25142
Sample dataframe:
id x y
145421 a b
356005 d a
478279 r f
451426 f p
566927 d k
I want to drop entire rows when column id is not equal to 356005,478279,566927
My code:
df[~(df["id"].isin([356005,478279,566927]))]
Output I want:
id x y
145421 a b
451426 f p
But after running the command Jupyter gets stuck and I had to restart several times. Is there any efficient way to write this code that compiles it instantly?
This might work for you:
indx = df[df["id"].apply(lambda x : x in [356005,478279,566927])]
df.drop(index=indx.index, inplace = True)
or
indx = df.query('id in [356005,478279,566927]')
df.drop(index=indx.index, inplace = True)
Also, it is a better approach in case the list of 'id' to be dropped is long.
I think you are looking for
df.drop(df.index[[356005,478279,566927]], inplace = True)
Please refer to pandas docs for more information.
I'm beginner in python and I need to translate some code in R to Python.
I need to find one root per row in a dataset based in a dynamic function, the code in R is:
library(rootSolve
library(dplyr)
library(plyr)
dataset = data.frame(A = c(10,20,30),B=c(20,10,40), FX = c("A+B-x","A-B+x","A*B-x"))
sol<- adply(dataset,1, summarize,
solution_0= uniroot.all(function(x)(eval(parse(text=as.character(FX),dataset))),lower = -10000, upper = 10000, tol = 0.00001))
This code return [30,-10,1200] as a solution for each row.
In python I read a documentation of optimize of sciPy package but i don't found a code that's work for me:
I tried a solutions like that below, but without sucess:
import pandas as pd
from scipy.optimize import fsolve as fs
data = {'A': [10,20,30],
'B': [20,10,40],
'FX': ["A+B-x","A-B+x","A*B-x"]}
df = pd.DataFrame(data)
def func(FX):
return(exec(FX))
fs(func(df.FX),x0=0,args=df)
Someone have idea how to solve this?
Very Thanks.
SymPy is a symbolic math library for Python. Your question can be solved as:
import pandas as pd
from sympy import Symbol, solve
from sympy.parsing.sympy_parser import parse_expr
data = {'A': [10,20,30],
'B': [20,10,40],
'FX': ["A+B-x","A-B+x","A*B-x"]}
df = pd.DataFrame(data)
x = Symbol("x", real=True)
for index, row in df.iterrows():
F = parse_expr(row['FX'], local_dict={'A': row['A'], 'B': row['B'], 'x':x})
print (row['A'], row['B'], row['FX'], "-->", F, "-->", solve(F, x))
This outputs:
10 20 A+B-x --> 30 - x --> [30]
20 10 A-B+x --> x + 10 --> [-10]
30 40 A*B-x --> 1200 - x --> [1200]
Note that SymPy's solve returns a list of solutions. If you are sure there is always exactly one solution, just use solve(F, x)[0]. (Remember that unlike R, Python always starts indexing with 0.)
With list comprehension, you could write the solution as:
sol = [ solve(parse_expr(row['FX'], local_dict={'A': row['A'], 'B': row['B'], 'x':x}),
x)[0] for _, row in df.iterrows() ]
If you have many columns, you can also create the dictionary with a loop: dict({c:row[c] for c in df.columns}, **{'x':x}) ). The weird ** syntax is needed if you want to combine the dictionaries inside the list comprehension. See this post about the union of dictionaries.
cols = df.columns # change this if you won't need all columns
sol = [ solve(parse_expr(row['FX'],
local_dict=dict({c:row[c] for c in cols}, **{'x':x}) ),
x)[0].evalf() for _, row in df.iterrows() ]
PS: SymPy normally keeps the solutions in a symbolic form because it prefers exact expressions. When there are e.g. fractions or square roots, they are not evaluated immediately. To get the evaluated form, use evalf() as in solve(F, x)[0].evalf().
I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.
After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.