How to handle and manipulate large IDs with pandas?

How to handle and manipulate large IDs with pandas? - python

I'm currently working with a database table that has an ID as primary key which can have up to 28 digits.
For my use case I need to manipulate some data points in this table (including the ID) and write it back to the db table.
Now, for the ID I need to increment it by one and I'm struggling to achieve this with pandas and windows.
Unfortunately and obviously, I cannot read and save the ID as plain integers in the dataframe.
Converting it to np.float64 beforehand seems to be completely messing up the values.
For example:
I'm manipulating the data point with ID 2021051800100770010113340000
If I convert the ID column to np.float64 by explicitly providing the dtype of this column,
the ID becomes 2021051800100769903675441152.0 which seems to be a completely different number to be.
Also I don't know if incrementing the ID column by 1 is working since the result will be same as the number above.
Is there a way to this in a proper way? The last option to me would be to convert it to a string and then change the last substring of that string. But I don't feel this would be good and clean solution. Not mentioning that I'm not sure if I can write this back to the db in that form.
edit//
Based on this suggestion (https://stackoverflow.com/a/21591439/3856569)
I edited the ID column the following way:
df["ID"] = df["ID"].apply(int)
and then incrementing the number.
I get the following result:
2021051800100769903675441152
2021051800100769903675441153
So the increment seems to work now but I still see completely different numbers opposed which I was getting originally.

Please bare with me to look at this problem from another angle. If we can understand how the ID is formed, we may be able to handle it differently, for example, the first 8 digits looks like a date, and if that is true, then any of your manipulation shouldn't modify those 8 digits unless your intention is to change the date. In this case, you can separate your ID (in str) into 2 parts.
20210518 / 00100770010113340000
Then now we only need to handle the second part which is still too large for np.int64. However, if you find out how it is formed, then perhaps you can further separate it and finally handle a number that can be handled by np.int64.
For example, would the ID be formed in this way?
20210518 / 001 / 007 / 7001011334 / 0000
If we can split it into segments of meaning, then we know which part we need to keep when manipulating (adding 1 in your case)

Related

python - creating a join with the optional ignoring of the first character

I have 2 data sets that I need to join via the asset id. I have columns in each table that can be used for the join but the excel docs are from different systems and therefore the formats are different.
My problem is dataset one will remove the zeros at the front of the dataset (see table below). These both relate to the same asset it's just the output from one system that removes the zeros at the front of the ID.
Asset
dataset 1
dataset 2
A
012345
12345
B
001235
1235
C
0891011
891011
I am currently creating a new column that removes the front values of the ID if they are "0" to match the datasets then doing the join by using lstrip().
I am wondering if there is a more efficient way of doing this within the join function.
Edit/Answer: It appears the way I have been doing it is the most efficient thanks for your help.

As I understand you want to remove leading zeros. See this discussion.
Example
data = "01234564500"
print(data.lstrip("0"))
print(data.rstrip("0"))
Output
1234564500
012345645

How to convert string True / False Pandas columns to int based on column index?

I have a very large dataframe where only the first two columns are not bools. However everything is brought in as a string due to the source. The True/False fields do contain actual blanks (not nan) as well and are spelled out 'True' and 'False'
I'm trying to come up with a dynamic-ish way to do this without typing out or listing every column.
ndf.iloc[:,2:].astype(bool)
That seems to at least run and change it to bool, but when I add 'inplace=True' it has no effect at storing the property types. I've also tried the code below to no luck. It runs but doesn't actually do anything that I can tell.
ndf.iloc[:,2:] = ndf.iloc[:,2:].astype(bool)
I need to be able to write this table back into a database as 0s and 1s ultimately. I'm not the most versed at bools and am hoping there is an easy one liner way to do this that I don't know yet.

Actually
ndf.iloc[:,2:] = ndf.iloc[:,2:].astype(bool)
should work and change your data from str/object to bool. It's just you get the same print out with 'True' and True. Check with ndf.dtypes to see the changes after that command.
If you want the booleans as 0 and 1, try:
ndf.iloc[:,2:] = ndf.iloc[:,2:].astype(bool).astype(int)

How does the isnull() method work to return all rows that are missing in my data frame?

I'm new to Python and just trying to figure out how this small bit of code works. Hoping this'll be easy to explain without an example data frame.
My data frame, called df_train, contains a column called Age. This column is NaN for 177 records.
I submit the following code...
df_train[df_train['Age'].isnull()]
... and it returns all records that are missing.
Now if I submit df_train['Age'].isnull(), all I get is a Boolean List of values. How does the data frame object then work to convert this Boolean List to the rows we actually want?
I don't understand how passing the boolean list to the data frame again results in just the 177 records that we need - could someone please ELI5 for a newbie?

You will have to create subsets of the dataframe you want to use. Suppose you want to use only those rows where df_train['Age'] is not null. In that case, you have to select
df_train_to_use = df_train[df_train['Age'].isnull() == False]
Now, you may cross check any other column that you may want to use and have nulls like
df_train['Column_name'].isnull().any()
If this returns True, you may go ahead and replace nulls with default values, average, zeros or whatever methods you prefer, usually put in application for machine learning programs.
Example
df_train['Column_name'].dropna()
df_train['Column_name'].fillna('') #for strings
df_train['Column_name'].fillna(0) #for int
df_train['Column_name'].fillna(0.0) #for float
Etc.
I hope this helps you explain.

How to find if there are wrong values in a pandas dataframe?

I am quite new in Python coding, and I am dealing with a big dataframe for my internship.
I had an issue as sometimes there are wrong values in my dataframe. For example I find string type values ("broken leaf") instead of integer type values as ("120 cm") or (NaN).
I know there is the df.replace() function, but therefore you need to know that there are wrong values. So how do I find if there are any wrong values inside my dataframe?
Thank you in advance

"120 cm" is a string, not an integer, so that's a confusing example. Some ways to find "unexpected" values include:
Use "describe" to examine the range of numerical values, to see if there are any far outside of your expected range.
Use "unique" to see the set of all values for cases where you expect a small number of permitted values, like a gender field.
Look at the datatypes of columns to see whether there are strings creeping in to fields that are supposed to be numerical.
Use regexps if valid values for a particular column follow a predictable pattern.

Python categorize datatypes

I plan to make a 'table' class that I can use throughout my data-analyzis program to store gathered data to. Objective is to make simple tables like this:
ID Mean size Stdv Date measured Relative flatness
----------------------------------------------------------------
1 133.4242 34.43 Oct 20, 2013 32093
2 239.244 34.43 Oct 21, 2012 3434
I will follow the sqlite3 suggestion from this post: python-data-structure-for-maintaing-tabular-data-in-memory, but I will still need to save it as a csv file (not as a dbase) and I want it to eat my data as we go: add columns on the fly whenever new measures become available and are deemed to be interesting. For that the class will need to be able to determine the data type of the data thrown at it.
Sqlite3 has limited datatypes, float, int, date and string. Python and numpy together have many types. Is there an easy was to quickly decide what the datatype is of the variable? So my table class can automatically add a column when new data is entered containing new fields.
I am not too concerned about performance, the table should be fairly small.
I want to use my class like so:
dt = Table()
dt.add_record({'ID':5, 'Mean size':39.4334'})
dt.add_record({'ID':5, 'Goodness of fit': 12})
In the last line, there is new data. the Table class needs to figure out what kind of data that is and then add a column to the sqlite3 table. Making it all string seems a bit to floppy, I still want to keep my high precision floats correct....
Also: If something like this already exists, I'd like to know about it.

It seems that your question is: "Is there an easy was to quickly decide what the datatype is of the variable?". This is a simple question, and the answer is:
type(variable).
But the context you provide requires a more careful answer.
Since SQLite3 only provides only a few data types (slightly different ones than what you said), you need to map your input variables to the types provided by SQLite3.
But you may encounter further problems: You may need to change the types of columns as you receive new records, if you do not want to require that the column type be fixed in advance.
For example, for the Goodness of fit column in your example, you get an int (12) first. But you may get a float (e.g. 10.1) the second time, which shows that both values must be interpreted as floats. And if next time you receive a string, then all of them must be strings, right? But then the exact formatting of the numbers also counts: whereas 12 and 12.0 are the same when you interpret them as floats, they are not when you interpret them as strings; and the first value may become "12.0" when you convert all of them to strings.
So either you throw an exception when the type of consecutive values for the same column do not match, or you try to convert the previous values according to the new ones; but occasionally you may need to re-read the input.
Nevertheless, once you make those decision regarding the expected behavior, it should not be a very difficult problem to implement.
Regarding your last question: I personally do not know of an existing implementation to this problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.