Data presentation difference in python - python

Hopefully a fairly simple answer to my issue.
When I run the following code:
print (data_1.iloc[1])
I get a nice, vertical presentation of the data, with each column value header, and its value presented on separate rows. This is very useful when looking at 2 sets of data, and trying to find discrepancies.
However, when I write the code as:
print (data_1.loc[data_1["Name"].isin(["John"])])
I get all the information arrayed across the screen, with the column header in 1 row, and the values in another row.
My question is:
Is there any way of using the second code, and getting the same vertical presentation of the data?

The difference is that data_1.iloc[1] returns a pandas Series whereas data_1.loc[data_1["Name"].isin(["John"])] returns a DataFrame. Pandas has different representations for these two data types (i.e. they print differently).
The reason iloc[1] gives you a Series is because you indexed it using a scalar. If you do data_1.iloc[[1]] you'll see you get a DataFrame instead. Conversely, I'm assuming that data_1["Name"].isin(["John"]) is returning a collection. If you wanted to get a Series instead you might try something like
print(data_1.loc[data_1["Name"].isin(["John"])[0]])
but only if you're sure you're getting one element back.

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".
try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

Pandas - value changes when adding a new column to a dataframe

I'm trying to add a new column to a dataframe using the following code.
labels = df1['labels']
df2['labels'] = labels
However, in the later part of my program, I found that there might be something wrong with the assignment. So, I checked it using
labels.equals(other=df2['labels'])
and I got a False. (I added this line instantly after assignment)
I also tried to
print out part of labels and df2, and it turns out that there are indeed some lines that are different.
check max and min values of both series, and they are different
check number of unique values in both series using len(set(labels)) and len(set(df2['labels'])), and they differs a lot
test with a smaller amount of data, but this works totally fine.
My dataframe is rather large (40 million+ lines), so I cannot print them all out and check the values. Does anyone have any idea about what might lead to this kind of problem? Or is there any suggestions for further tests?

BatchDataset not subscriptable when trying to format Python dictionary as table

I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):
Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']
BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.

Pandas add column to new data frame at associated string value?

I am trying to add a column from one dataframe to another,
df.head()
street_map2[["PRE_DIR","ST_NAME","ST_TYPE","STREET_ID"]].head()
The PRE_DIR is just the prefix of the street name. What I want to do is add the column STREET_ID at the associated street to df. I have tried a few approaches but my inexperience with pandas and the comparison of strings is getting in the way,
street_map2['STREET'] = df["STREET"]
street_map2['STREET'] = np.where(street_map2['STREET'] == street_map2["ST_NAME"])
The above code shows an "ValueError: Length of values does not match length of index". I've also tried using street_map2['STREET'].str in street_map2["ST_NAME"].str. Can anyone think of a good way to do this? (note it doesn't need to be 100% accurate just get most and it can be completely different from the approach tried above)
EDIT Thank you to all who have tried so far I have not resolved the issues yet. Here is some more data,
street_map2["ST_NAME"]
I have tried this approach as suggested but still have some indexing problems,
def get_street_id(street_name):
return street_map2[street_map2['ST_NAME'].isin(df["STREET"])].iloc[0].ST_NAME
df["STREET_ID"] = df["STREET"].map(get_street_id)
df["STREET_ID"]
This throws this error,
If it helps the data frames are not the same length. Any more ideas or a way to fix the above would be greatly appreciated.
For you to do this, you need to merge these dataframes. One way to do it is:
df.merge(street_map2, left_on='STREET', right_on='ST_NAME')
What this will do is: it will look for equal values in ST_NAME and STREET columns and fill the rows with values from the other columns from both dataframes.
Check this link for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Also, the strings on the columns you try to merge on have to match perfectly (case included).
You can do something like this, with a map function:
df["STREET_ID"] = df["STREET"].map(get_street_id)
Where get_street_id is defined as a function that, given a value from df["STREET"]. will return a value to insert into the new column:
(disclaimer; currently untested)
def get_street_id(street_name):
return street_map2[street_map2["ST_NAME"] == street_name].iloc[0].ST_NAME
We get a dataframe of street_map2 filtered by where the st-name column is the same as the street-name:
street_map2[street_map2["ST_NAME"] == street_name]
Then we take the first element of that with iloc[0], and return the ST_NAME value.
We can then add that error-tolerance that you've addressed in your question by updating the indexing operation:
...
street_map2[street_map2["ST_NAME"].str.contains(street_name)]
...
or perhaps,
...
street_map2[street_map2["ST_NAME"].str.startswith(street_name)]
...
Or, more flexibly:
...
street_map2[
street_map2["ST_NAME"].str.lower().replace("street", "st").startswith(street_name.lower().replace("street", "st"))
]
...
...which will lowercase both values, convert, for example, "street" to "st" (so the mapping is more likely to overlap) and then check for equality.
If this is still not working for you, you may unfortunately need to come up with a more accurate mapping dataset between your street names! It is very possible that the street names are just too different to easily match with string comparisons.
(If you're able to provide some examples of street names and where they should overlap, we may be able to help you better develop a "fuzzy" match!)
Alright, I managed to figure it out but the solution probably won't be too helpful if you aren't in the exact same situation with the same data. Bernardo Alencar's answer was essential correct except I was unable to apply an operation on the strings while doing the merge (I still am not sure if there is a way to do it). I found another dataset that had the street names formatted similar to the first. I then merged the first with the third new data frame. After this I had the first and second both with columns ["STREET_ID"]. Then I finally managed to merge the second one with the combined one by using,
temp = combined["STREET_ID"]
CrimesToMapDF = street_maps.merge(temp, left_on='STREET_ID', right_on='STREET_ID')
Thus getting the desired final data frame with associated street ID's

incorrect mean from PANDAS dataframe

So here's an interesting thing:
Using python 2.7:
I've got a dataframe of about 5,100 entries, each with a number (melting point) in a column titled 'Tm'. Using the code:
self.sort_df[['Tm']].mean(axis=0)
I get a mean of:
Tm 92.969204
dtype: float64
This doesn't make sense because no entry has a Tm of greater than 83.
Does .mean() not work for this many values? I've tried pairing down the dataset and it seems to work for ~1,000 entries but considering I have full dataset of 150,000 to run at once, I'd like to know if I need to find a different way to calculate the mean.
A more readable syntax would be :
sort_df['Tm'].mean()
Try to do a sort_df['Tm'].value_counts() or sort_df['Tm'].max() to see what values are present. Some unexpected values must have crept up.
The .mean function gives accurate result irrespective of the size.

Categories