Pandas - locating using index vs column name - python

I don't understand why the codes below produces 2 different outputs when both referring to a column in a csv file. The second also includes rows that have NaN values, whereas the first code removes it. I don't know why it does that though so can someone please explain? Thanks!
import pandas as pd
df = pd.read_csv('climate_data_2017.csv')
is_over_35 = df["Maximum temperature (C)"] > 35
vs
import pandas as pd
df = pd.read_csv('climate_data_2017.csv')
is_over_35 = df[[3]] > 35

Related

Python Pandas number regex for part number

import pandas as pd
df = pd.read_csv('test.csv', dtype='unicode')
df.dropna(subset=["Description.1"], inplace = True)
df_filtered = df[(df['Part'].str.contains("-")==True) & (df['Part'].str.len()==8)]
I am trying to get python pandas to only filter in the Part column to show numbers in this format: "###-####"
I cannot seem to figure out how to only show those. Any help would be greatly appreciated.
Right now, I have it where it filters part numbers with a '-' in them, and where the length is 8 digits long. Even with this, I am still getting some that aren't the correct format to our internal format.
Can't seem to find anything similar to this online, and I am fairly new to Python.
Thanks
A small example
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""name,dig
aaa,750-2220
bbb,12-214
ccc,120
ddd,1020-10"""))
df.loc[df.dig.str.contains(r"\d{3}-\d{4}")]
which outputs
name dig
0 aaa 750-2220

Compare Every 2 row Using Pandas and show the different

I have problem with our code to compare every 2 row with different excel file:
and we have code to compare every row:
import pandas as pd
import numpy as np
old_df = pd.read_excel('Test.xlsx', sheet_name="Best Practice Config", names="A", header=None)
new_df = pd.read_excel('Test.xlsx',sheet_name="Existing Config", names="B", header=None)
compare = old_df[~old_df["A"].isin(new_df["B"])
but i need compare 2 row , Please advise what is the best way of pandas to do that.
Try to use pandas.DataFrame.compare method. The documentation is available here.
old_df.compare(new_df)
I hope it will be useful for you.

How can I get the difference between values in a Pandas dataframe grouped by another field?

I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)

Pandas DataFrame, columns within a column

Here below is the CSV file that I'm working with:
I'm trying to get my hands on the enj coin: (United States) column. Nonetheless when I try printing all of the columns of the DataFrame it doesn't appear to be treated as a column
Code:
import pandas as pd
df = pd.read_csv("/multiTimeline.csv")
print(df.columns)
I get the following output:
Index(['Category: All categories'], dtype='object')
I've tried accessing the column with df['Category: All categories']['enj coin: (United States)'] but sadly it doesn't work.
Question:
Could someone possibly explain to me how I could possibly transform this DataFrame (which has only one column Category: All categories) into a DataFrame which has two columns Time and enj coin: (United States)?
Thank you very much for your help
Try using the parameter skiprows=2 when reading in the CSV. I.e.
df = pd.read_csv("/multiTimeline.csv", skiprows=2)
The csv looks good.
Ignore the complex header at the top.
pd.read_csv(csvdata, header=[1])
The entire header can be taken in as well, although it is not delimited as the data is.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""Category: All categories
Time,enj coin: (United States)
2019-04-10T19,7
2019-04-10T20,20""")
df = pd.read_csv(csvdata, header=[0,1])
print(df)
0.24.2
Category: All categories
Time
2019-04-10T19 7
2019-04-10T20 20

How to implement an Conditions in an Panda Data Frame, to save it

I would like to create a counter for a Panda DataFrame to save only until a certain Value in a specific column.
f.e save only until df['cycle'] == 2.
For what I gathered from the answers below is that df[df['cycle']<=2] will solve my Problem.
Edit: If I am correct python pandas always read the whole file, only if us nrows than you say f.e go to index x but want if I don't want to use Index but a specific value from a column. How can I do that?
See my code below:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
df[df['cycle']<=2]
df.to_csv(path_or_buf='test.out', index=True, sep='\t', columns=['time','A','B','cycle'], decimal='.')
So I modified the answer according to the suggestion from the users.
I am glad for every help that I can get.

Categories