How do I search for a tuple of values in pandas?

How do I search for a tuple of values in pandas? - python

I'm trying to write a function to swap a dictionary of targets with results in a pandas dataframe. I'd like to match a tuple of values and swap out new values. I tried building it as follows, but the the row select isn't working. I feel like I'm missing some critical function here.
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
target=("Mammals","Birds")
swapVals={("Cats","Parrots"):("Rats","Canaries")}
for x in swapVals:
#Attempt 1:
#testData.loc[x,target]=swapVals[x]
#Attempt 2:
testData[testData.loc[:,target]==x,target]=swapVals[x]

This was written in Python 2, but the basic idea should work for you. It uses the apply function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries")}
target=["Mammals","Birds"]
def swapper(in_row):
temp =tuple(in_row.values)
if temp in swapVals:
return list(swapVals[temp])
else:
return in_row
testData[target] = testData[target].apply(swapper, axis=1)
testData
Note that if you loaded the other keys into the dict, you could do the apply without the swapper function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries"), ("Dogs","Cockatiels"):("Dogs","Cockatiels")}
target=["Mammals","Birds"]
testData[target] = testData[target].apply(lambda x: list(swapVals[tuple(x.values)]), axis=1)
testData

Related

Python index value from list, return index value and list value

I'm trying to use a list of specific codes to index any time one of those codes is used and then return the value of that code and the parameter name associated with it.
import numpy as np
import pandas as pd
param_list = pd.read_csv(r'C:/Users/Gordo/Documents/GraduateSchool/Research/GroundWaterML/parameter_cd_query.csv')
#def p_list():
# return [param_list['p_cd'], param_list['param_nm']]
for item, value in param_list['p_cd'], param_list['parm_nm']:
if item in ['p00010','p00020','p00025','p00058','p00059','p00090','p00095','p00191','p00300','p00301','p00400','p00405','p00410',
'p00450','p00452','p00453','p00602','p00607','p00608','p00613','p00618','p00631','p00660','p00666','p00671',
'p00681','p00900','p00904','p00905','p00915','p00925','p00930','p00931','p00932','p00935','p00940',
'p00945','p00950','p00955','p01000','p01005','p01010','p01020','p01025','p01030','p01035','p01040','p01046',
'p01049','p01060','p01065','p01080','p01085','p01090','p01106','p01130','p01145','p01155','p04035','p07000',
'p09511','p22703','p29801','p39086','p49933','p50624','p61028','p62636','p62639','p62642','p62645',
'p63041','p63162','p63790','p70300','p70301','p70303','p71846','p71851','p71856','p71865','p71870','p72015',
'p72016','p72019','p82081','p82082','p82085','p90095','p99832','p99833','p99834']:
print (item, value)

If I understand your question correctly, you have your own predefined codes and you're trying to see if items from a csv file matches any of your codes. If that's the case, you can get all the matches just by filtering the dataframe (since you're using pandas anyway).
import pandas as pd
param_df = pd.read_csv(r'C:/Users/Gordo/Documents/GraduateSchool/Research/GroundWaterML/parameter_cd_query.csv')
my_codes = ['p00010','p00020','p00025','p00058','p00059','p00090','p00095','p00191','p00300','p00301','p00400','p00405','p00410',
'p00450','p00452','p00453','p00602','p00607','p00608','p00613','p00618','p00631','p00660','p00666','p00671',
'p00681','p00900','p00904','p00905','p00915','p00925','p00930','p00931','p00932','p00935','p00940',
'p00945','p00950','p00955','p01000','p01005','p01010','p01020','p01025','p01030','p01035','p01040','p01046',
'p01049','p01060','p01065','p01080','p01085','p01090','p01106','p01130','p01145','p01155','p04035','p07000',
'p09511','p22703','p29801','p39086','p49933','p50624','p61028','p62636','p62639','p62642','p62645',
'p63041','p63162','p63790','p70300','p70301','p70303','p71846','p71851','p71856','p71865','p71870','p72015',
'p72016','p72019','p82081','p82082','p82085','p90095','p99832','p99833','p99834']
result = param_df[param_df.p_cd.isin(my_codes)]
result gives you all matches. If you just want an array with the first match, you can do:
result.loc[0].values

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.

As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

Append model output to pd df rows

I'm trying to put Pyomo model output into pandas.DataFrame rows. I'm accomplishing it now by saving data as a .csv, then reading the .csv file as a DataFrame. I would like to skip the .csv step and put output directly into a DataFrame.
When I accomplish an optimization solution with Pyomo, the optimal assignments are 1 in the model.x[i] output data (0 otherwise). model.x[i] is indexed by dict keys in v. model.x is specific syntax to Pyomo
Pyomo assigns a timeItem[i], platItem[i], payItem[i], demItem[i], v[i] for each value that presents an optimal solution. The 0807results.csv file produces an accurate file of the optimal assignments showing the value of timeItem[i], platItem[i], payItem[i], demItem[i], v[i] for each valid assignment in the optimal solution.
When model.x[i] is 1, how can I get timeItem[i], platItem[i], payItem[i], demItem[i], v[i] directly into a DataFrame? Your assistance is greatly appreciated. My current code is below.
index=sorted(v.keys())
with open('0807results.csv', 'w') as f:
for i in index:
if value(model.x[i])>0:
f.write("%s,%s,%s,%s,%s\n"%(timeItem[i],platItem[i],payItem[i], demItem[i],v[i]))
from pandas import read_csv
now = datetime.datetime.now()
dtg=(now.strftime("%Y%m%d_%H%M"))
df = read_csv('0807results.csv')
df.columns = ['Time', 'Platform','Payload','DemandType','Value']
# convert payload types to string so not summed
df['Payload'] = df['Payload'].astype(str)
df = df.sort_values('Time')
df.to_csv('results'+(dtg)+'.csv')
# do stats & visualization with pandas df

I have no idea what is in the timeItem etc iterables from the code you've posted. However, I suspect that something similar to:
import pandas as pd
results = pd.DataFrame([timeItem, platItem, payItem, demItem, v], index=["time", "plat", "pay", "dem", "v"]).T
Will work.
If you want to filter on 1s in model.x, you might add it as a column as well, and do a filter with pandas directly:
import pandas as pd
results = pd.DataFrame([timeItem, platItem, payItem, demItem, v, model.x], index=["time", "plat", "pay", "dem", "v", "x"]).T
filtered_results = results[results["x"]>0]

You can also use the DataFrame.from_records() function:
def record_generator():
for i in sorted(v.keys()):
if value(model.x[i] > 1E-6): # integer tolerance
yield (timeItem[i], platItem[i], payItem[i], demItem[i], v[i])
df = pandas.DataFrame.from_records(
record_generator(), columns=['Time', 'Platform', 'Payload', 'DemandType', 'Value'])

pyspark RDD to DataFrame

I am new to Spark.
I have a DataFrame and I used the following command to group it by 'userid'
def test_groupby(df):
return list(df)
high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
lambda row: row.userid).mapValues(test_groupby)
It gives a RDD which in following structure:
(326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])
326033430 is the big group.
My question is how can I convert this RDD back to a DataFrame Structure? If I cannot do that, how I can get values from the Row term?
Thank you.

You should just
from pyspark.sql.functions import *
high_volumn = self.df\
.filter(self.df.outmoney >= 1000)\
.groupBy('userid').agg(collect_list('col'))
and in .agg method pass what You want to do with rest of data.
Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg

Exporting Pandas DataFrame with MultiIndex

I have just discovered pandas and am impressed by its capabilities.
I am having difficulties understanding how to work with DataFrame with MultiIndex.
I have two questions :
(1) Exporting the DataFrame
Here my problem:
This dataset
import pandas as pd
import StringIO
d1 = StringIO.StringIO(
"""Gender,Employed,Region,Degree
m,yes,east,ba
m,yes,north,ba
f,yes,south,ba
f,no,east,ba
f,no,east,bsc
m,no,north,bsc
m,yes,south,ma
f,yes,west,phd
m,no,west,phd
m,yes,west,phd """
)
df = pd.read_csv(d1)
# Frequencies tables
tab1 = pd.crosstab(df.Gender, df.Region)
tab2 = pd.crosstab(df.Gender, [df.Region, df.Degree])
tab3 = pd.crosstab([df.Gender, df.Employed], [df.Region, df.Degree])
# Now we export the datasets
tab1.to_excel('H:/test_tab1.xlsx') # OK
tab2.to_excel('H:/test_tab2.xlsx') # fails
tab3.to_excel('H:/test_tab3.xlsx') # fails
One work-around I could think of is to change the columns (The way R does)
def NewColums(DFwithMultiIndex):
NewCol = []
for item in DFwithMultiIndex.columns:
NewCol.append('-'.join(item))
return NewCol
# New Columns
tab2.columns = NewColums(tab2)
tab3.columns = NewColums(tab3)
# New export
tab2.to_excel('H:/test_tab2.xlsx') # OK
tab3.to_excel('H:/test_tab3.xlsx') # OK
My question is : Is there a more efficient way to do this in Pandas that I missed in the documentation ?
2) Selecting columns
This new structure does not allow to select colums on a given variable (the advantage of hierarchical indexing in first place). How can I select columns containing a given string (e.g. '-ba') ?
P.S: I have seen this question which is related but have not understood the reply proposed

This looks like a bug in to_excel, for the moment as a workaround I would recommend using to_csv (which seems not to show this issue).
I added this as an issue on github.
To answer the second question, if you really need to use to_excel...
You can use filter to select only those columns which include '-ba':
In [21]: filter(lambda x: '-ba' in x, tab2.columns)
Out[21]: ['east-ba', 'north-ba', 'south-ba']
In [22]: tab2[filter(lambda x: '-ba' in x, tab2.columns)]
Out[22]:
east-ba north-ba south-ba
Gender
f 1 0 1
m 1 1 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I search for a tuple of values in pandas? - python

Related

Python index value from list, return index value and list value

Dataframe with arrays and key-pairs

Append model output to pd df rows

pyspark RDD to DataFrame

Exporting Pandas DataFrame with MultiIndex

Categories

Resources