Weird lines in python file, can't extract column - python

0 0: 1
0 1: 1
0 1: 0
1 0: 0
I have a file, that looks like something above.
I am trying to extract this by columns into arrays by using numpy.loadtxt of python. Ideally, I want many arrays, or at least a data structure in which the arrays are [0,0,0,1], [0,1,1,0]. To my utter discomfort, because there is that semicolon after the second number, I'm unable to use numpy.loadtxt. Would anyone have any solutions to how to either surpass that, or simply remove that semicolon without having to really separate the file?

np.loadtxt(file, converters = {1: lambda s: int(s.strip(":"))})
From numpy.loadtxt:
converters : dict, optional
A dictionary mapping column number to a function that will convert that column to a float. E.g., if column 0 is a date string: converters = {0: datestr2num}. Converters can also be used to provide a default value for missing data (but see also genfromtxt): converters = {3: lambda s: float(s.strip() or 0)}. Default: None.

Related

Pandas - Choose several floats from the same string in Pandas to operate with them

I have a dataframe extracted with Pandas for which one of the colums looks something like this:
What I want to do is to extract the numerical values (floats) in this column, which by itself I could do. The issue comes because I have some cells, like the cell 20 in the image, in which I have more than one number, so I would like to make an average of these values. I think that for that I would first need to recognize the different groups of numerical values in the string (each float number) and then extract them as floats to then operate with them. I don't know how to do this.
Edit: I have found an solution to this using the re.findall command from regex. This is based on the answer of a question in this thread Find all floats or ints in a given string.
for index,value in z.iteritems():
z[index]=statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',value)])
Note that I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
However, I get a warning with this approach, due to the loop (there is no warning when I do it only for one element of the series):
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Although I don't see any issue happening with my data, is this warning important?
I think you can benefit from the Pandas vectorized operations here. Use findall over the original dataframe and apply in sequence the pd.Series to transform from list to columns and pd.to_numeric to convert from string to numeric type (default return dtype is float64). Then calculate the average of the values on each row with .mean(axis=1).
import pandas as pd
d = {0: {0: '2.469 (VLT: emission host)',
1: '1.942 (VLT: absorption)',
2: '1.1715 (VLT: absorption)',
3: '0.42 (NOT: absorption)|0.4245 (GTC)|0.4250 (ESO-VLT UT2: absorption & emission)',
4: '3.3765 (VLT: absorption)',
5: '1.86 (Xinglong: absorption)| 1.86 (GMG: absorption)|1.859 (VLT: absorption)',
6: '<2.4 (NOT: inferred)'}}
df = pd.DataFrame(d)
print(df)
s_mean = df[0].str.findall(r'(?:\b\d{1,2}\b(?:\.\d*))')\
.apply(pd.Series)\
.apply(pd.to_numeric)\
.mean(axis=1)
print(s_mean)
Output from s_mean
0 2.469000
1 1.942000
2 1.171500
3 0.423167
4 3.376500
5 1.859667
6 2.400000
I have found a solution based on what I wrote previously in the Edit of the original post:
It consists on using the re.findall() command with regex, as posted in this thread Find all floats or ints in a given string:
statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)])
Then, to loop over the dataframe column, just use the lambda x: method with the pandas apply command (df.apply). For this, I have defined a function (redshift_to_num) executing the operation above, and then apply this function to each element in the dataframe column:
import re
import pandas as pd
import statistics
def redshift_to_num(string):
measures=[float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)]
mean=statistics.mean(measures)
return mean
df.Redshift=df.Redshift.apply(lambda x: redshift_to_num(x))
Notes:
The data of interest in my case is stored in the dataframe column df.Redshift.
In the re.findall command I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.

How do I call a value from a list inside of a pandas dataframe?

I have a some data that I put into a pandas dataframe. Inside of cell [0,5] I have a list of times that I want to call and be printed out.
Dataframe:
GAME_A PROCESSING_SPEED
yellow_selected 19
red_selected 0
yellow_total 19
red_total 60
counters [0.849998, 1.066601, 0.883263, 0.91658, 0.96668]
Code:
import pandas as pd
df = pd.read_csv('data.csv', sep = '>')
print(df.iloc[0])
proc_speed = df.iat[0,5]
print(proc_speed[2])
When I try to print the 3rd time in the dictionary I get .. I tried to use a for loop to print the times but instead I get this. How can I call the specific values from the list. How would I print out the 3rd time 0.883263?
[
0
.
8
4
9
9
9
8
,
1
.
0
6
6
...
This happens because with the way you are loading the data, the column 'PROCESSING_SPEED' is read as an object type, therefore, all elements of that series are considered strings (i.e., in this case proc_speed = "[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]", which is exactly the string the loop is printing character by character).
Before printing the values you desire to display (from that cell), one should convert the string to a list of numbers, for example:
proc_speed = df.iat[4,1]
proc_speed = [float(s) for s in proc_speed[1:-1].split(',')]
for num in proc_speed:
print( num)
Where proc_speed[1:-1].split(',') takes the string containing the list, except for the brackets at the beginning and end, and splits it according to the commas separating values.
In general, we have to be careful when loading columns with varying or ambiguous data types, as Pandas could have trouble parsing them correctly or in the way we want/expect it to be.
You can simply call proc_speed[index] as you have set this variable as a list. Here is a working example, note my call to df.iat has different indexes;
d = {'GAME_A':['yellow_selected', 'red_selected', 'yellow_total', 'red_total', 'counters'],'PROCESSING_SPEED':[19,0,19,60,[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]]}
df = pd.DataFrame(d)
proc_speed = df.iat[4, 1]
for i in proc_speed:
print(i)
0.849998
1.066601
0.883263
0.91658
0.96668
proc_speed[1]
1.066601
proc_speed[3]
0.91658
You can convert with apply, it's easier than splitting, and converts your ints to ints:
pd.read_clipboard(sep="(?!\s+(?<=,\s))\s+")['PROCESSING_SPEED'].apply(eval)[4][2]
# 0.883263

Pandas read from csv parsing incorrectly large integers

Hi i am having a problem.
I am read from csv a file various columns and one of the columns is a 19 digit integer ID. The problem is if i just read it with no options the number is read as float. And in this case it seems to be mixing the numbers:
For example the dataset has 100k of unique ID values but reading like that give me 10k unique values. I changed the read_csv options to read it as string and the problem remains while its being read as mathematical notation (eg: *e^18).
pd.set_option('display.float_format', lambda x: '%.0f' % x)
df=pd.read_csv(file)
Asked, it really happens when you read BigInteger value from .scv via pd.read_csv. For example:
df = pd.read_csv('/home/user/data.csv', dtype=dict(col_a=str, col_b=np.int64))
# where both col_a and col_b contain same value: 107870610895524558
After reading following conditions are True:
df.col_a == '107870610895524558'
df.col_a.astype(int) == 107870610895524558
# BUT
df.col_b == 107870610895524560
Thus I suggest that in case of reading big integers it is possible to read them as string and then convert column type to int

Python: How does converters work in genfromtxt() function?

I am new to Python, I have a following example that I don't understand
The following is a csv file with some data
%%writefile wood.csv
item,material,number
100,oak,33
110,maple,14
120,oak,7
145,birch,3
Then, the example tries to define a function to convert those trees name above to integers.
tree_to_int = dict(oak = 1,
maple=2,
birch=3)
def convert(s):
return tree_to_int.get(s, 0)
The first question is why is there a "0" after "s"? I removed that "0" and get same result.
The last step is to read those data by numpy.array
data = np.genfromtxt('wood.csv',
delimiter=',',
dtype=np.int,
names=True,
converters={1:convert}
)
I was wondering for the converters argument, what does {1:convert} exact mean? Especially what does number 1 mean in this case?
For the second question, according to the documentation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html), {1:convert} is a dictionary whose keys are column numbers (where the first column is column 0) and whose values are functions that convert the entries in that column.
So in this code, the 1 indicates column one of the csv file, the one with the names of the trees. Including this argument causes numpy to use the convert function to replace the tree names with their corresponding numbers in data.

How to replace all non-numbers while importing csv to python?

I want to import a dirty csv file into a numpy object. I have a very few amount of values that are apparently not an integer or float, because the output is not of the correct dtype.
The code I use:
d.data = np.genfromtxt(inputtable, delimiter=";",skip_header=2, comments="#", dtype=np.float)
I would like to know if there is an easy way to just replace all non floats into -1 so that I do not need to find these values by hand in the 10.000+ rows.
You just have a provide a set of callbacks as the converters argument, as documented here:
converters : variable, optional
The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing
data: converters = {3: lambda s: float(s or 0)}.

Categories