I have different sources(CSV) of similar data set which i want to merge into single data and write it to my DB. Since data is coming from different sources, they use different headers in their CSV, i want to merge these columns with logical meaning.
So far, i have tried reading all headers first and re reading the files to first get all the data in a single data frame and then doing if else to merge the columns together with same meaning. Ideally I would like to create a mapping file with all possible column names per column and then read CSV using that mapping. The data is not ordered or sorted between files. Number of columns might be different too but they all have the columns i am interested in.
Sample data:
File 1:
id, name, total_amount...
1, "test", 123 ..
File 2:
member_id, tot_amnt, name
2, "test2", 1234 ..
i want this to look like
id, name, total_amount...
1, "test", 123...
2, "test2", 1234...
...
I can't think of an elegant way to do this, would be great to get some direction or help with this.
Thanks
Use skiprows and header=None to skip the header, names to specify your own list of column names, and concat to merge into a single df. i.e.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['a','b','c']),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['a','b','c'])]
)
Edit: If the different files differ only by column order you can specify different column orders to names and if you want to select a subset of columns use usecols. But you need to do this mapping in advance, either by probing the file, or some other rule.
This requires mapping files to handlers somehow
i.e.
file1.csv
id, name, total_amount
1, "test", 123
file2.csv
member_id, tot_amnt, ignore, name
2, 1234, -1, "test2"
The following selects the common 3 columns and renames / reorders.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])],
sort=False
)
Edit 2:
And a nice way to apply this is to use lambda's and maps - i.e.
parsers = {
"schema1": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
"schema2": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])
}
map = {
"file2.csv": "schema2",
"file1.csv": "schema1"}
pd.concat([parsers[v](k) for k,v in map.items()], sort=False)
This is what i ended up doing and found to be the cleanest solution. Thanks David your help.
dict1= {'member_number': 'id', 'full name': 'name', …}
dict2= {'member_id': 'id', 'name': 'name', …}
parsers = {
"schema1": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())),
"schema2": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys()))
}
map = {
'schema1': (a_file.csv,dict1),
'schema2': (b_file.csv,dict2)
}
total = []
for k,v in map.items():
d = parsers[k](v[0], v[1])
d.rename(columns=v[1], inplace=True)
total.append(d)
final_df = pd.concat(total, sort=False)
Related
orginal:
expected result:
Task:
I am trying to merge the 'urls column' into one row if there exist a same name in the other column ('full path') using python and jupyter notebook.
I have tried using groupby but it doesnt pass me the result i want.
Code:
df.groupby("Full Path").apply(lambda x: ", ".join(x)).reset_index()
not what i am expecting:
The reason it is not working is that you need to modify the column for the full path first before passing it to group by since there are differences in the full paths.
Based on the sample here the following should work:
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
This code of course assumes that the grouping you want for the full path occurs in the first two items. The \n will disappear when you write the df out to Excel.
NOTE: Unless the Type and Date fields are all the same value, you cannot include them in the group by since for example, if you did groupby(['Full Path', 'Type', 'Date']) you would end up with not all the links being aggregated for an individual path+folder combination. If you wanted them to be included in a comma-separated next line column like the url, you would need to add that to the agg statement and use the replace for those as well.
Code used for testing:
import pandas as pd
pd.options.display.max_colwidth = 999
data_dict = {
'Full Path': [
'downloads/Residences Singapore',
'downloads/Residences Singapore/15234523524352',
'downloads/Residences Singapore/41242341324',
],
'Type': [
'Folder',
'File',
'File',
],
'Date': [
'07-05-22 19:24',
'07-05-22 19:24',
'07-05-22 19:24',
],
'url': [
'https://www.google.com/drive/storage/345243534534522345',
'https://www.google.com/drive/storage/523405923405672340567834589065',
'https://www.google.com/drive/storage/90658360945869012141234',
],
}
df = pd.DataFrame(data_dict)
df['Full Path'] = df['Full Path'].str.split('/').str[0:2].str.join('/')
test = df.groupby(by=['Full Path']).agg({'url': ', Next'.join})
test['url'] = test['url'].str.replace("Next","\n")
test
Output
Just groupby the FullPath and value as URL field, aggregate with comma separator. enter image description here
A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.
I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")
I have a column in a pandas data frame that contains string like the following format as for example
fullyRandom=true+mapSizeDividedBy64=51048
mapSizeDividedBy16000=9756+fullyRandom=false
qType=MpmcArrayQueue+qCapacity=822398+burstSize=664
count=11087+mySeed=2+maxLength=9490
capacity=27281
capacity=79882
we can read for example the first row as 2 parameters separated by '+' each parameter has a value, that clear by '=' that separate between the parameter and its value.
in Output, I'm asking if there is a python script that either extract the parameters we retrieve a list of unique parameters like the following
[fullyRandom,mapSizeDividedBy64,mapSizeDividedBy64,qType,qCapacity,qCapacity, count,mySeed,maxLength,Capacity]
Notice from the previous list that it contains only the unique parameters without its values
Or extended pandas data frame if it's not too difficult if we can parse the following column and convert into many columns, each column is for one parameter that store it's value in it
Try this, it will store the values in a list.
data = []
with open('<your text file>', 'r') as file:
content = file.readlines()
for row in content:
if '+' in row:
sub_row = row.strip('\n').split('+')
for r in sub_row:
data.append(r)
else:
data.append(row.strip('\n'))
print(data)
Output:
['fullyRandom=true', 'mapSizeDividedBy64=51048', 'mapSizeDividedBy16000=9756', 'fullyRandom=false', 'qType=MpmcArrayQueue', 'qCapacity=822398', 'burstSize=664', 'count=11087', 'mySeed=2', 'maxLength=9490', 'capacity=27281', 'capacity=79882']
to convert to a list of dict that could be used in pandas:
dict_list = []
for item in data:
df = {
item.split('=')[0]: item.split('=')[1]
}
dict_list.append(df)
print(dict_list)
Output:
[{'fullyRandom': 'true'}, {'mapSizeDividedBy64': '51048'}, {'mapSizeDividedBy16000': '9756'}, {'fullyRandom': 'false'}, {'qType': 'MpmcArrayQueue'}, {'qCapacity': '822398'}, {'burstSize': '664'}, {'count': '11087'}, {'mySeed': '2'}, {'maxLength': '9490'}, {'capacity': '27281'}, {'capacity': '79882'}]
To just get the headers:
dict_list.append(item.split('=')[0])
Output:
['fullyRandom', 'mapSizeDividedBy64', 'mapSizeDividedBy16000', 'fullyRandom', 'qType', 'qCapacity', 'burstSize', 'count', 'mySeed', 'maxLength', 'capacity', 'capacity']
My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.
I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. I'm trying to convert a CSV to JSON.
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared.
Here's a link to one record in the CSV.
Think of this sample as two different parties attached to one document.
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate.
While the fields of each entity are unique.
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema.
Here's what I've tried. It doesn't contain all fields because I'm having a difficult time understanding what it all means.
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}
You might need to use the json_normalize function from pandas.io.json
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize.
Below is one way to export your data:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants contain duplicates and must run drop_duplicates() as we do on parcels, then assign(entity) can be moved to defining the participants in the get_nested_data() function:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')