I'm trying to check if all my expected values are in pandas dataframe. The expected values are known ahead of time and the dataframe is automatically generated from a database query.
This is an example of what I'm trying to do
import pandas as pd
import StringIO
expected_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
csv = StringIO.StringIO("""ExpectedID,Random Value
1,val1
2,val2
3,val3
8,val8
9,val9
10,val10
""")
df = pd.read_csv(csv, sep=",")
for e in expected_ids:
if e not in df['ExpectedID']:
print "Missing: ", e
My problem is that I have to check each value I'm expecting individually and in my real code there are approximately 14000 of these. I'd also like to pull the missing ones into another dataframe that I can manipulate later but don't know how to do that.
The other problem I have is that the above prints this:
Missing: 6
Missing: 7
Missing: 8
Missing: 9
Missing: 10
Those values aren't all correct. I am missing 6 and 7, but 8, 9, and 10 are in the df. It also doesn't say that 4 and 5 are missing.
How can I accurately check if multiple values are in a dataframe column?
df['ExpectedId'] is a Series and behaves like a dict when you test for membership:
In [5]: df.ExpectedId
Out[5]:
0 1
1 2
2 3
3 8
4 9
5 10
Name: ExpectedID, dtype: int64
In [6]: 0 in df['ExpectedID']
Out[6]: True
You should test for membership in df['ExpectedId'].values instead.
Related
I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:
subtraction_value = 3
subtraction_value = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
print (data['new_column']
[6,1,2,1,-5,0,-1,3,4,6]
However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:
data['new_column'] = np.where((data.index%4) | (data.index%5),
data['test']-subtraction_value,
data['test']-subtraction_value_2)
However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:
print(data['new_column'])
[6,-2,2,1,-2,-3,-4,3,7,6])
As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!
As mentioned in the comment section, the output should equal
[6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:
import pandas as pd
import numpy as np
from itertools import chain
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i+1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)
# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
test new_column
0 12 6
1 4 -2
2 5 2
3 4 1
4 1 -2
5 3 -3
6 2 -4
7 5 2
8 10 7
9 9 6
---------------------------------------------
As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.
This question already has answers here:
Finding median of list in Python
(28 answers)
Closed 2 years ago.
I have a csv file and I want to sort (lowest from greatest) the first column.
The first column's name is "CRIM".
I can read the first column, but I can't sort it, the numbers are floats.
Also, I would like to find the median of the list.
This is what I did so far:
import csv
with open('data.csv', newline='') as csvfile:
data = csv.DictReader(csvfile)
for line in data:
print(line['CRIM'])
I would advise using pandas >> dataframe.median()
Eg data:
A B C D
0 12 5 20 14
1 4 2 16 3
2 5 54 7 17
3 44 3 3 2
4 1 2 8 6
# importing pandas as pd
import pandas as pd
# for your csv
# df = pd.read_csv('data.csv')
# Creating the dataframe (example)
df = pd.DataFrame({"A":[12, 4, 5, 44, 1],
"B":[5, 2, 54, 3, 2],
"C":[20, 16, 7, 3, 8],
"D":[14, 3, 17, 2, 6]})
# Find median Even if we do not specify axis = 0, the method
# will return the median over the index axis by default
df.median(axis = 0)
A 5.0
B 3.0
C 8.0
D 6.0
dtype: float64
df['A'].median(axis = 0)
5.0
https://www.programiz.com/python-programming/methods/built-in/sorted
Use sorted():
CRIM_sorted = sorted(line['CRIM'])
For the median, you can use a package or just build your own:
Finding median of list in Python
Let's say that I have som data from a file where some columns are "of the same kind", only of different subscripts of some mathematical variable, say x:
n A B C x[0] x[1] x[2]
0 1 2 3 4 5 6
1 2 3 4 5 6 7
Is there some way I can load this into a pandas dataframe df and somehow treat the three x-columns as an indexable, array-like entity (I'm new to pandas)? I believe it would be convenient, because I could do operations on the data-series contained in x such as sum(df.x).
Kind regards.
EDIT:
Admittedly, my original post was not clear enough. I'm not just interested in getting the sum of three columns. That was just an example. I'm looking for a generally applicable abstraction that I hope is built into pandas.
I'd like to have multiple columns accessible through (sub-)indices of one entity, e.g. df.x[0], such that I (or any other user of the data) can do whichever operation he/she wants (sum/max/min/avg/standard deviation, you name it). You can consider the x's as an ensamble of time-dependent measurements if you like.
Kind regards.
Consider, you define your dataframe like this
df = pd.DataFrame([[1, 2, 3, 4, 5, 6],
[2, 3, 4, 5, 6, 7]], columns=['A', 'B', 'C', 'x0', 'x1', 'x2'])
Then with
x = ['x0', 'x1', 'x2']
You use the following notation allowing a quite general definition of x
>>> df[x].sum(axis=1)
0 15
1 18
dtype: int64
Look of column which starts with 'x' and perform operations you need
column_num=[col for col in df.columns if col.startswith('x')]
df[column_num].sum(axis=1)
I'll give you another answer which will defer from you initial data structure in exchange for addressing the values of the dataframe by df.x[0] etc.
Consider you have defined your dataframe like this
>>> dv = pd.DataFrame(np.random.randint(10, size=20),
index=pd.MultiIndex.from_product([range(4), range(5)]), columns=['x'])
>>> dv
x
0 0 8
1 3
2 4
3 6
4 1
1 0 8
1 9
2 1
3 8
4 8
[...]
Then you can exactly do this
dv.x[1]
0 8
1 9
2 1
3 8
4 8
Name: x, dtype: int64
which is your desired notation. Requires some changes to your initial set-up but will give you exactly what you want.
Here is my example:
import pandas as pd
df = pd.DataFrame({'col_1':[1,5,6,77,9],'col_2':[6,2,4,2,5]})
df.index = [8,9,10,11,12]
This sub-setting is by row order:
df.col_1[2:5]
returns
10 6
11 77
12 9
Name: col_1, dtype: int64
while this subsetting is already by index and does not to work:
df.col_1[2]
returns:
KeyError: 2
I find it very confusing and am curios what is the reason behind it?
You're statements are ambiguous, therefore it best to explicitly define what you want.
df.col_1[2:5] is working like df.col_1.iloc[2:5] using integer location.
Where as df.col[2] is working like df.col_1.loc[2] using index label location, hence there is no index labelled 2, so you get the KeyError.
Hence is best to defined whether are are using integer location with .iloc or index label location using .loc.
See Pandas Indexing docs.
Let's assume this is the initial DataFrame:
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=list('abcde')
)
df
Out:
col_1 col_2
a 1 6
b 5 2
c 6 4
d 77 2
e 9 5
The index consists of strings so it is generally obvious what you are trying to do:
df['col_1']['b'] You passed a string so you are probably trying to access by label. It returns 5.
df['col_1'][1] You passed an integer so you are probably trying to access by position. It returns 5.
Same deal with slices: df['col_1']['b':'d'] uses labels and df['col_1'][1:4] uses positions.
When the index is also integer, nothing is obvious anymore.
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=[8, 9, 10, 11, 12]
)
df
Out:
col_1 col_2
8 1 6
9 5 2
10 6 4
11 77 2
12 9 5
Let's say you type df['col_1'][8]. Are you trying to access by label or by position? What if it was a slice? Nobody knows. At this point, pandas chooses one of them based on their usage. It is in the end a Series and what distinguishes a Series from an array is its labels so the choice for df['col_1'][8] is labels. Slicing with labels is not that common so pandas is being smart here and using positions when you pass a slice. Is it inconsistent? Yes. Should you avoid it? Yes. This is the main reason ix was deprecated.
Explicit is better than implicit so use either iloc or loc when there is room for ambiguity. loc will always raise a KeyError if you try to access an item by position and iloc will always raise a KeyError if you try to access by label.
When storing data in a json object with to_json, and reading it back with read_json, rows and columns are returned sorted alphabetically. Is there a way to keep the results ordered or reorder them upon retrieval?
You could use orient='split', which stores the index and column information in lists, which preserve order:
In [34]: df
Out[34]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
In [35]: df.to_json(orient='split')
Out[35]: '{"columns":["A","C","B"],"index":[5,4,3],"data":[[0,1,2],[3,4,5],[6,7,8]]}'
In [36]: pd.read_json(df.to_json(orient='split'), orient='split')
Out[36]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
Just remember to use orient='split' on reading as well, or you'll get
In [37]: pd.read_json(df.to_json(orient='split'))
Out[37]:
columns data index
0 A [0, 1, 2] 5
1 C [3, 4, 5] 4
2 B [6, 7, 8] 3
If you want to make a format with "orient='records'" and keep orders of the column, try to make a function like this. I don't think it is a wise approach, and do not recommend because it does not guarantee its order.
def df_to_json(df):
res_arr = []
ldf = df.copy()
ldf=ldf.fillna('')
lcolumns = [ldf.index.name] + list(ldf.columns)
for key, value in ldf.iterrows():
lvalues = [key] + list(value)
res_arr.append(dict(zip(lcolumns, lvalues)))
return json.dumps(res_arr)
In addition, for reading without sorted column please ref this [link] (Python json.loads changes the order of the object)
Good Luck
lets say you have a pandas dataframe, that you read
import pandas as pd
df = pd.read_json ('/abc.json')
df.head()
that give following
now there are two ways to save to json using pandas to_json
result.sample(200).to_json('abc_sample.json',orient='split')
that will give the order like this one column
however, to preserve the order like in csv, use this one
result.sample(200).to_json('abc_sample_2nd.json',orient='records')
this will give result as