Python Pandas number regex for part number - python

import pandas as pd
df = pd.read_csv('test.csv', dtype='unicode')
df.dropna(subset=["Description.1"], inplace = True)
df_filtered = df[(df['Part'].str.contains("-")==True) & (df['Part'].str.len()==8)]
I am trying to get python pandas to only filter in the Part column to show numbers in this format: "###-####"
I cannot seem to figure out how to only show those. Any help would be greatly appreciated.
Right now, I have it where it filters part numbers with a '-' in them, and where the length is 8 digits long. Even with this, I am still getting some that aren't the correct format to our internal format.
Can't seem to find anything similar to this online, and I am fairly new to Python.
Thanks

A small example
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""name,dig
aaa,750-2220
bbb,12-214
ccc,120
ddd,1020-10"""))
df.loc[df.dig.str.contains(r"\d{3}-\d{4}")]
which outputs
name dig
0 aaa 750-2220

Related

Pandas read_csv truncating 0s in zip code [duplicate]

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Pandas - locating using index vs column name

I don't understand why the codes below produces 2 different outputs when both referring to a column in a csv file. The second also includes rows that have NaN values, whereas the first code removes it. I don't know why it does that though so can someone please explain? Thanks!
import pandas as pd
df = pd.read_csv('climate_data_2017.csv')
is_over_35 = df["Maximum temperature (C)"] > 35
vs
import pandas as pd
df = pd.read_csv('climate_data_2017.csv')
is_over_35 = df[[3]] > 35

Pandas series changed through CSV export import

I need to save pandas series and make sure that, once loaded again, they are exactly the same. However, they are not. I tried to manipulate the result in various ways but cannot find a solution. This is my MWE:
import pandas as pd
idx = pd.date_range(start='2010', periods=100, freq='1M')
ts = pd.Series(data=range(100), index=idx)
ts.to_csv(f'test.csv')
imported_ts= pd.read_csv('test.csv', delimiter=',', index_col=None)
print(ts.equals(imported_ts))
>>> False
What am I doing wrong?
You cannot. A pandas Series contains an index and a data column, both having a type (the dtype), a (possibly complex) title which itself has a type, and values.
A CSV file is just a text file which contains text representations of values and optionaly the text representation of the title in first row. Nothing more. When things are simple, meaning if the titles are simple strings, and all values are integers or small decimal (*), the save-load round trip will give you exactly what you initially had.
But if you have more complex use cases, for example date types, or object dtype columns containing decimal.Decimal values, the generated CSV file will only contain a textual representation with no type information. So it is impossible to make sure of the original dtype by reading the content of a csv file, the reason why the read_csv method has so many options.
(*) by small decimal I mean a small number of digits after the decimal point.
I resolved this issue by using pickle instead.
import pandas as pd
idx = pd.date_range(start='2010', periods=100, freq='1M')
ts = pd.Series(data=range(100), index=idx)
ts.to_pickle("./test.pkl")
unpickled_df = pd.read_pickle("./test.pkl")
print(ts.equals(unpickled_df))
>>> True
what happening is read_csv by default is looking for a dataframe even if it is a single column, in addition due to the lack of csv typing, it could possibly be more difficult then my suggestio. i that case see #Serge Ballesta's answer
if its a simple case, try to convert the result :
print(ts.equals(imported_ts.iloc[:,0]))
You are saving the dates as index and comparing with the values of your df. Do this instead..
import pandas as pd
idx = pd.date_range(start='2010', periods=100, freq='1M')
ts = pd.Series(data=range(100), index=idx)
ts.to_csv(f'test.csv')
imported_ts= pd.read_csv('test.csv', delimiter=',', index_col=['Unnamed: 0'])
print(ts.index.equals(imported_ts.index))
Gives
True

How can I get the difference between values in a Pandas dataframe grouped by another field?

I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)

How to implement an Conditions in an Panda Data Frame, to save it

I would like to create a counter for a Panda DataFrame to save only until a certain Value in a specific column.
f.e save only until df['cycle'] == 2.
For what I gathered from the answers below is that df[df['cycle']<=2] will solve my Problem.
Edit: If I am correct python pandas always read the whole file, only if us nrows than you say f.e go to index x but want if I don't want to use Index but a specific value from a column. How can I do that?
See my code below:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
df[df['cycle']<=2]
df.to_csv(path_or_buf='test.out', index=True, sep='\t', columns=['time','A','B','cycle'], decimal='.')
So I modified the answer according to the suggestion from the users.
I am glad for every help that I can get.

Categories