Auxiliary data / description in pandas DataFrame - python

Imagine I have pd.DataFrame containing a lot of rows and I want to add some string data describing/summarizing it.
One way to do this would be to add column with this value, but it would be a complete waste of memory (e.g. when the string would be a long specification of ML model).
I could also insert it in file name - but this has limitations and is very impractical at saving/loading.
I could store this data in separate file, but it is not what I want.
I could make a class based on pd.DataFrame and add this field, but then I am unsure if save/load pickle would still work properly.
So is there some real clean way to store something like "description" in pandas DataFrame? Preferentially such that would withstand to_pickle/read_pickle operations.
[EDIT]: Ofc, this generates next questions such as what if we concatenate dataframes with such informations? But let's stick to saving/loading problem only.

Googled it a bit, you can name a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame( data=np.ones([10,10]) )
df.name = 'My name'
print(df.name)

Related

Looking for names that somewhat match

I am creating an app that loads a DataFrame from spreadsheet. Sometimes the names are not the same as before (someone changed a column name slightly) and are just barely different. What would be the best way to "look" for these closely related column names in new spreadsheets when loading the df? If I as python to look for "col_1" but in the users spreadsheet the column is "col1" then python won't find it.
Example:
import pandas as pd
df = pd.read_excel('data.xlsx')
Here are the column names I am looking for, the rest of the columns load just fine and the column names that a just barely different get skipped and the data never gets loaded. How can I make sure that if the name is close to the name that python is looking for it will get loaded in the df?
Names I'm looking for:
'Water Consumption'
'Female Weight'
'Uniformity'
data.xlsx, incoming data column names that are slightly different:
'Water Consumed Actual'
'Body Weight Actual'
'Unif %'
The builtin function difflib.get_close_matches can help you with column names that are slightly wrong. Using that with the usecols argument of pd.read_excel should get you most of the way there.
You can do something like:
import difflib
import pandas as pd
desired_columns = ['Water Consumption', 'Female Weight', 'Uniformity']
def column_checker(col_name):
if difflib.get_close_matches(col_name, desired_columns):
return True:
else:
return False
df = pd.read_excel('data.xlsx', usecols=column_checker)
You can mess with the parameters of get_close_matches to make it more or less sensitive too.

Importing 'prefixed-line' csv into pandas

I have a csv formatted like below which I'd like to import into pandas. Basically, it is a bunch of measurements and data from multiple samples.
dataformat1,sample1,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sample1,completion_time_for_sample1
...
dataformat1,sampleN,sample_source,start_time
dataformat2,parameter1,spec_a_1,spec_b_1,measurement1
dataformat2,parameter2,spec_a_2,spec_b_2,measurement2
dataformat2,parameter3,spec_a_3,spec_b_3,measurement3
dataformat3,result_code_for_sampleN,completion_time_for_sampleN
Essentially, the first field of the csv is describing the data format of the rest of that line. In the pandas dataframe I would like to import all these into a single dataframe and fill them to relevant section. I am currently planning to do this by:
prepending the line number into the csv
reading each dataformat# into a separate dataframe
combining, sorting, and ffill/bfill/unpivot(?) shenanigans to fill in all the data
I assume this will work, but I am wondering if there's a cleaner way to do this either within pandas or using some other library. It is a somewhat common data logging paradigm in the work I do.

Converting weird NBA data structure into a pandas dataframe from an API

I am trying to convert this string into a pandas dataframe where each value before the colon is a header and the next value is put in the column. This is what the data structure looks like:
{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,"abbreviation":"CHA","city":"Charlotte","conference":"East","division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z","home_team":{"id":2,"abbreviation":"BOS","city":"Boston","conference":"East","division":"Atlantic","full_name":"Boston Celtics","name":"Celtics"}}
This sting I am converting has hundreds of games but I stop pasting after two. What is this data structure called and how can I quickly move it to a dataframe?
It looks like json, but it's not structured in a very useful way; everything is under the "data" key. To get it into a DataFrame, use a combination of Python's json module and pandas.normalize_json().
import json
import pandas as pd
data_string = """{"data":[{"id":47179,"date":"2019-01-30T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"},"home_team_score":126,"period":4,"postseason":false,
"season":2018,"status":"Final","time":" ","visitor_team":{"id":4,
"abbreviation":"CHA","city":"Charlotte","conference":"East",
"division":"Southeast","full_name":"Charlotte Hornets","name":"Hornets"},
"visitor_team_score":94},{"id":48751,"date":"2019-02-09T00:00:00.000Z",
"home_team":{"id":2,"abbreviation":"BOS","city":"Boston",
"conference":"East","division":"Atlantic","full_name":"Boston Celtics",
"name":"Celtics"}}]}"""
raw_data = json.loads(data_string)
df = pd.json_normalize(raw_data['data'])
There is a lot of redundant information about the teams. Look at "home_team" for the two games, the information is the same. You could pull all the team data into a separate DataFrame and use the team "id" or "abbreviation" in the DataFrame with all the games.

Python bson: How to create list of ObjectIds in a New Column

I have a CSV that I'm needing to create a column of random unique MongoDB ids in python.
Here's my csv file:
import pandas as pd
df = pd.read_csv('file.csv', sep=';')
print(df)
Zone
Zone_1
Zone_2
I'm currently using this line of code to generate a unique ObjectId - UPDATE
import bson
x = bson.objectid.ObjectId()
df['objectids'] = x
print(df)
Zone; objectids
Zone_1; 5bce2e42f6738f20cc12518d
Zone_2; 5bce2e42f6738f20cc12518d
How can I get the ObjectId to be unique for each row?
Hate to see you down voted... stack overflow goes nuts with the duplicate question nonsense, refuses to provide useful assistance, then down votes you for having the courage to ask about something you don't know.
The referenced question clearly has nothing to do with ObjectIds, let alone adding them (or any other object not internal to NumPy or Pandas) into data frames.
You probably need to use a map
this assumes the column "objectids" is not a Series in your frame and "Zone" is a Series in your frame
df['objectids'] = df['Zone'].map(lambda x: bson.objectid.ObjectId())
Maps are super helpful (though slow) way to poke every record in your series and particularly helpful as an initial method to connect external functions.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

Replicating Excel's VLOOKUP in Python Pandas

Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.
I want to take the following excel file:
Before
and:
1)convert the 'before' file into a pandas data frame
2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
3)the result should look like the 'After' file. I would like to convert this back into CSV format.
After
So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.
I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.
I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.
Thanks in advance for any help, i really appreciate it.
Colin
I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.
import pandas as pd
import numpy as np
df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df
The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.

Categories