Pandas read from csv parsing incorrectly large integers - python

Hi i am having a problem.
I am read from csv a file various columns and one of the columns is a 19 digit integer ID. The problem is if i just read it with no options the number is read as float. And in this case it seems to be mixing the numbers:
For example the dataset has 100k of unique ID values but reading like that give me 10k unique values. I changed the read_csv options to read it as string and the problem remains while its being read as mathematical notation (eg: *e^18).
pd.set_option('display.float_format', lambda x: '%.0f' % x)
df=pd.read_csv(file)

Asked, it really happens when you read BigInteger value from .scv via pd.read_csv. For example:
df = pd.read_csv('/home/user/data.csv', dtype=dict(col_a=str, col_b=np.int64))
# where both col_a and col_b contain same value: 107870610895524558
After reading following conditions are True:
df.col_a == '107870610895524558'
df.col_a.astype(int) == 107870610895524558
# BUT
df.col_b == 107870610895524560
Thus I suggest that in case of reading big integers it is possible to read them as string and then convert column type to int

Related

Pandas: copying values from column.UID to to either column8 or column9, depending on the length of the value

I have a data frame with one of the columns (UID) having 7 numbers or 10 numbers.
I have written a regex to identify 7 or 10 numbers (thanks to a very similar question in stackoverflow). These seem to work well on a text file.
no_7 = re.compile('(?<![0-9])[0-9]{7}(?![0-9])')
no_9 = re.compile('(?<![0-9])[0-9]{9}(?![0-9])')
Again, thanks to stackoverflow, I have written the following.
If the column is of 7 numbers, the values are copied to the second to last column.
df['column8']=df['UID'].apply(lambda x: x if(x == re.findall(no_7, x)) else 'NaN')
If the column is of 10 numbers, the column are copied to the last column
df['column9']=df['UID'].apply(lambda x: X if(x == re.findall(no_9, x)) else 'NaN')
While debugging the problem I was able to find out that the regex is never able to read the column with numbers as a number.
Regex complains:
TypeError: expected string or bytes-like object
I have tried setting column "UID" pd.to_numeric
I have tried setting column "UID" df["UID"].astype(int)
I have tried setting column "UID" df["UID"].apply(np.int64)
All assuming that the problem is that the column is incorrectly formatted, which I think it not, any longer.
You are obviously using the int type in your column and need str to apply string operations. You can convert using:
df['UID'].astype(str)
However, there are probably much better ways to do what you want, please improve your question as requested for a better response.

Processing Data by Datatype in Pandas

I have a DataFrame containing int and str data which I have to process through.
I would like to separate the text and the numerical values in each cell into separate columns, so that I can compute on the numerical data.
My columns are similar to this:
I have read about doing something like this through the apply function and applymap function, but I can't design such a function as I am new to pandas. It should basically do -
def separator():
if cell has str:
Add str part to another column(Check column), leave int inplace.
else:
Add 'NA' to Check column
You can do this using extract with a followed to_numeric:
import pandas as pd
df = pd.DataFrame({'a_mrk4': ['042FP', '077', '079', '1234A-BC D..EF']})
df[['a_mrk4', 'Check']] = df['a_mrk4'].str.extract(r'(\d+)(.*)')
df['a_mrk4'] = pd.to_numeric(df['a_mrk4'])
print(df)
Output:
a_mrk4 Check
0 42 FP
1 77
2 79
3 1234 A-BC D..EF
you can use regular expressions, let's say that you have a column (target_col) and the data follow the pattern digits+text then you can use the following column
df.target_col.str.extractall(r'(/d)(/w)')
you can tweak the re to match your specific needs

Pandas adding decimal points when using read_csv

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)
You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.
It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

Python dataframe how to group by one column and get sum of other column

I want to create a new data frame which has 2 columns, grouped by Striker_Id and other column which has sum of 'Batsman_Scored' corresponding to the grouped 'Striker_Id'
Eg:
Striker_ID Batsman_Scored
1 0
2 8
...
I tried this ball.groupby(['Striker_Id'])['Batsman_Scored'].sum() but this is what I get:
Striker_Id
1 0000040141000010111000001000020000004001010001...
2 0000000446404106064011111011100012106110621402...
3 0000121111114060001000101001011010010001041011...
4 0114110102100100011010000000006010011001111101...
5 0140016010010040000101111100101000111410011000...
6 1100100000104141011141001004001211200001110111...
It doesn't sum, only joins all the numbers. What's the alternative?
For some reason, your columns were loaded as strings. While loading them from a CSV, try applying a converter -
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : int})
Or,
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : pd.to_numeric})
If that doesn't work, then convert to integer after loading -
df['Batsman_Scored'] = df['Batsman_Scored'].astype(int)
Or,
df['Batsman_Scored'] = pd.to_numeric(df['Batsman_Scored'], errors='coerce')
Now, performing the groupby should work -
r = df.groupby('Striker_Id')['Batsman_Scored'].sum()
Without access to your data, I can only speculate. But it seems like, at some point, your data contains non-numeric data that prevents pandas from being able to perform conversions, resulting in those columns being retained as strings. It's a little difficult to pinpoint this problematic data until you actually load it in and do something like
df.col.str.isdigit().any()
That'll tell you if there are any non-numeric items. Note that it only works for integers, float columns cannot be debugged like this.
Also, another way of seeing what columns have corrupt data would be to query dtypes -
df.dtypes
Which will give you a listing of all columns and their datatypes. Use this to figure out what columns need parsing -
for c in df.columns[df.dtypes == object]:
print(c)
You can then apply the methods outlined above to fix them.

Python: How does converters work in genfromtxt() function?

I am new to Python, I have a following example that I don't understand
The following is a csv file with some data
%%writefile wood.csv
item,material,number
100,oak,33
110,maple,14
120,oak,7
145,birch,3
Then, the example tries to define a function to convert those trees name above to integers.
tree_to_int = dict(oak = 1,
maple=2,
birch=3)
def convert(s):
return tree_to_int.get(s, 0)
The first question is why is there a "0" after "s"? I removed that "0" and get same result.
The last step is to read those data by numpy.array
data = np.genfromtxt('wood.csv',
delimiter=',',
dtype=np.int,
names=True,
converters={1:convert}
)
I was wondering for the converters argument, what does {1:convert} exact mean? Especially what does number 1 mean in this case?
For the second question, according to the documentation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html), {1:convert} is a dictionary whose keys are column numbers (where the first column is column 0) and whose values are functions that convert the entries in that column.
So in this code, the 1 indicates column one of the csv file, the one with the names of the trees. Including this argument causes numpy to use the convert function to replace the tree names with their corresponding numbers in data.

Categories