I'm using pandas, and have two data frames:
df1:
id date status rpbid rpfid
1 d1 closed null 10
2 d2 closed null 11
3 d3 closed null null
and df2:
id date status rpbid rpfid
10 d10 updated 1 null
11 d11 updated 2 9
9 d9 updated 11 null
The idea is that I would like to handle 2 cases:
1. where the closed record was the first and final record for that instance (id 3 in df1),
2. where the closed record has one more updated records linked in df2.
rfbid and rpbid are for replacedbyid and replacementforid
So the resulting df would be:
id date status rpbid rpfid id2 date2 rpbid2 rpfip2
1 d1 closed null 10 10 d10 1 null
2 d2 closed null 11 9 d9 11 null
3 d3 closed null null null null null null
So far, I've tried doing a first left join on df1 and df2, to get the all of the first recursive joins, I then tried using a loop to check whether rpbid2 was null, if it wasn't I looked back at df1 for the rpdid2 value in the id column of df2, I would then like to update that second half of the merged data frame to be the next step in the join where applicable.
Here is the original code: I haven't been able to get it not error
import pandas as pd
df = pd.read_csv(filename)
df_initial = df.loc[df['LetterStatus']=='CLOSED']
dfx = df.loc[df['LetterStatus']=='UPDATED']
df_merged = pd.merge(df_initial,dfx,how='left',left_on='ReferenceNumber',right_on='ReplacedByRefNumber')
df_copy = df_merged
for row in range(len(df_merged)):
if len(str(df_merged.iloc[row]['ReplacedByRefNumber_y'])) > 1:
row_slice = dfx.iloc[['ReferenceNumber']==df_merged[row['ReplacedByRefNumber_y']]
if row_slice.size == 0:
df_merged.iloc[row]['ReplacedByRefNumber_y']='Unknown'
df_copy.iloc[row]['ReplacedByRefNumber_y']='Unknown'
else:
df_copy.iloc[row][24:0]=row_slice
print(df_copy)
For more context; if the replacedbyID is null and the status is 'updated', that means the it was first record for that given order.
Disclaimer: instead of trying to find out to the bottom of question a way to obtain that data frame you want, my intention with this answer is rather to make sure you are absolutely certain of what you are doing, and maybe help you out structure your data a bit better.
For what I see, you have two data frames that have reciprocal bindings in each of them to update their content; problem is you are showing an example in which a row that is to update another element in the other data frame, is as well to be updated.
You are making a mess of the data structure and row identification. I am not very clear about whether the id references in each data frame refer to a row in their own data frame or in the other; ID's are not sorted by inclusion order. When you make your joint data frame, you have recursively included a column to still replace data again, making your data frame grow horizontally with no actual purpose.
I think that you have tried to make your own way of updating data and now you come across problems that you would not otherwise had if you used a data structure that is more common and already thought through to be scalable and easy to manipulate.
If you want to update data, best way is to use a data model that has a table inside a database correctly formatted (you can still use pandas; data frames can be your tables). You can update data when an update request comes right to the place where the to-be-updated content is, instead of keeping a separate record of updates, that at the same time have update requests themselves. That is very messy. If you want to keep a record of updates, you have to have a table that is constantly being updated, and then another table that shows record of each already-executed manipulation into the table. You can store in the latter the previous value and the value it was updated for.
You have to name your data frames properly, and when you make a reference to an ID in another data frame, that field's name has to inherently indicate that the reference is to that data frame.
You can include dates in the field, you don't have to make references to another table. That does not look good. Just use the datetime.datetime module and dump the object into the data frame; Python takes care of the rest.
Variable names too should be somewhat self-explanatory: instead of using rpbid, and having to explain to everyone that means replacedbyid, just use replaced_by_id (note the underscores to separate words).
Related
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
It's my first time using Jupyter Notebook to analyze survey data (.sav file), and I would like to read it in a way it will show the metadata so I can connect the answers with the questions. I'm totally a newbie in this field, so any help is appreciated!
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_sav('./SimData/survey_1.sav')
type(df)
type(meta)
df.head()
Please lmk if there is an additional step needed for me to be able to see the metadata!
The meta object contains the metadata you are looking for. Probably the most useful attributes to look at are:
meta.column_names_to_labels : it's a dictionary with column names as you have in your pandas dataframe to labels meaning longer explanations on the meaning of each column
print(meta.column_names_to_labels)
meta.variable_value_labels : a dict where keys are column names and values are a dict where the keys are values you find in your dataframe and values are value labels.
print(meta.variable_value_labels)
For instance if you have a column "gender' with values 1 and 2, you could get:
{"gender": {1:"male", 2:"female"}}
which means value 1 is male and 2 female.
You can get those labels from the beginning if you pass the argument apply_value_formats :
df, meta = pyreadstat.read_sav('survey.sav', apply_value_formats=True)
You can also apply those value formats to your dataframe anytime with pyreadstat.set_value_labels which returns a copy of your dataframe with labels:
df_copy = pyreadstat.set_value_labels(df, meta)
meta.missing_ranges : you get labels for missing values. Let's say in the survey in certain variable they encoded 1 meaning yes, 2 no and then mussing values, 5 meaning didn't answer, 6 person not at home. When you read the dataframe by default you will get values 1 and 2 and NaN (missing) instead of 5 and 6. You can pass the argument user_missing to get 5 and 6, and meta.missing_ranges will tell you that 5 and 6 are missing values. Variable_value_labels will give you the "didn't answer" and "person not at home" labels.
df, meta = pyreadstat.read_sav("survey.sav", user_missing=True)
print(meta.missing_ranges)
print(meta.variable_value_labels)
These are the potential pieces of information useful for your case, not necessarily all of these pieces will be present in your dataset.
More information here: https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html
This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 2 years ago.
I have a dataset with some customer information, with one column containing device codes (identifying the device used). I need to translate this codes into actual model names.
I also have a second table with a column holding device codes (same as the first table) and another column holding the corresponding model names.
I know it may seem trivial, I have managed to translate codes into models by using a for loop, .loc method and conditional substitution, but I'm looking for a more structured solution.
Here's an extract of the data.
df = pd.DataFrame(
{
'Device_code': ['SM-A520F','SM-A520F','iPhone9,3','LG-H860', 'WAS-LX1A', 'WAS-LX1A']
}
)
transcription_table=pd.DataFrame(
{
'Device_code': ['SM-A520F','SM-A520X','iPhone9,3','LG-H860', 'WAS-LX1A', 'XT1662','iPhone11,2'],
'models': ['Galaxy A5(2017)','Galaxy A5(2017)','iPhone 7','LG G5', 'P10 lite', 'Motorola Moto M','iPhone XS']
}
)
Basically I need to obtain the explicit model of the device every time there's a match between the device_code column of the two tables, and overwrite the device_code of the first table (df) with the actual model name (or, it can be written on the same row into a newly created column, this is less of a problem).
Thank you for your help.
Turn your transcription_table into an actual mapping (aka a dictionary) and then use Series.map:
transcription_dict = dict(transcription_table.values)
df['models'] = df['Device_code'].map(transcription_dict)
print(df)
output:
Device_code models
0 SM-A520F Galaxy A5(2017)
1 SM-A520F Galaxy A5(2017)
2 iPhone9,3 iPhone 7
3 LG-H860 LG G5
4 WAS-LX1A P10 lite
5 WAS-LX1A P10 lite
This is just one solution:
# Dictionary that maps device codes to models
mapping = transcription_table.set_index('Device_code').to_dict()['models']
# Apply mapping to a new column in the dataframe
# If no match is found, None will be filled in
df['Model'] = df['Device_code'].apply(lambda x: mapping.get(x))
I am building a shallow array in pandas containing pair of values (concept - document)
doc1 doc2
concept1 1 0
concept2 0 1
concept3 1 0
I parse a XML file and get pairs (concepts - doc)
every time a new pair comes in I add it to the pandas.
Since the pairs coming in might or might not contain values already present in the rows and/or columns (whatever new concept or whatever new column) I use the following code:
onp=np.arange(1,21,1).reshape(4,5)
oindex=['concept1','concept2','concept3','concept4',]
ohead=['doc1','doc2','doc3','doc5','doc6']
data=onp
mydf=pd.DataFrame(data,index=oindex, columns=ohead)
#... loop ...
mydf.loc['conceptXX','ep8']=1
it works well, only that the value in the data frame is 1.0 and not 1, boolean, and when a new row and/or column is added then the rest of the values are NaN. How can I avoid that. All the values added should be 0 or 1. (Note: the intention is to have also some columns for calculations, so I can not transform just all the dataframe into boolean type for instance:
mydf=mydf.astype(object)
thanks.
SECOND EDIT AFTER ALollz COMMENT
More explanation of the real problem.
I have an XML file that gives me the data in the following way:
<names>
<name>michael</name>
<documents>
<document>doc1</document>
<document>doc2</document>
</documents>
</name>
<name>mathieu</name>
<documents>
<document>doc1</document>
<document>docN</document>
</documents>
</name>
</names>
...
I want to pass this data to a dataframe to make calculations. Basically there are names that appear in different documents when parsing the XML with:
tree = ET.parse(myinputFile)
root = tree.getroot()
I am going adding one by one new values into the dataframe.
When adding sometimes a name is already present in the dataframe, but a new doc has to be added and viceversa.
I hope to have clarify a bit
I was about to write this as solution:
mydf.fillna(0, inplace=True)
mydf=mydf.astype(int)
changing all the NaN values by 0 and then convert them into int to avoid floats.
that has a negative side because i might want to have some columns with text data. in that case an error occur.
I am beginning to move from R to Python and have a stupid question.
I have been looking for close to 5 hours to find a solution to my question.
I have the following code in R, which essentially takes the dataframe df and aggregates the outdates from a hospital based on unique ids. So my original table has many UIds repeated since someone may visit a hospital many times and each time they leave the hospital they have an out date. I want the UID, and all the outdates in one row. I could do this very easily with the following code in R.
newdf= aggregate(data = df, OutDate~UID, FUN=paste, sep="," )
Can anyone pray tell me how this can be accomplished in Python?
HEre's what my table looks like after using the above function in R
-UID1, 10/20/2008, 11/30/2008, 1/1/1900, 1/1/1900
-UID2, 6/19/2010, 1/1/1900
-UID3, 11/17/2009
-UID4, 3/14/2010 , 4/20/2010, 1/1/1900, 1/1/1900
-UID5, 12/12/2008, 8/27/2009, 1/1/1900
Ignore the dates, i just made them up. But the output needs to look like above.
Previously I had multiple UID1 rows for each of the dates in the current columns.
Now how do I do this in python.
You can do this with a dictionary comprehension:
from collections import defauldict
d = defaultdict(list)
for f in df.values():
// Assuming the first value is the UID:
d[f[0]].append(f)
Now d is a dictionary, where each key is the UID and the values are a list of rows from the dataframe. You can combine them into a string (like what you are doing with paste), like this:
for uid,values in d.iteritems():
for value in values:
print('{},{}'.format(uid,','.join(value)))
This sounds like building a dictionary where the key is the UID and you append each outdate to the key as you loop through the data. This assumes that you are getting the data in the form of a csv file where3 each row of data is read by csv.DictReader. I make the assumption based on what you seem to show of the data file and the separators. As a result, each entry in the row (which can include in time, out time, diagnosis, etc) is keyed by the header row. I will alsao assume that you can tell how to read the data into csv processing. The quick code below shows how to generate the dictionary entries from the row once you have it in.
I show the final way the data will look followed by how it was derived.
data = {UID1:(out1, out2, out3), UID2:(out3, out4)}
data = {}
for d in datarow:
uid = d[UID]
if uid not in data.keys():
data[uid] = ()
out = d[OUT]
data[uid].append(out)