df4 = []
for i in (my_data.points.values.tolist()[0]):
df3 = pd.json_normalize(j)
df4.append(df3)
df5 = pd.DataFrame(df4)
df5.head()
When I run this code I get this error: Must pass 2-d input. shape=(16001, 1, 3)
pd.json_normalize will change the json data to table format, but what you need to have is an array of dictionaries to be able to convert to a dataframe.
For example
dict_list=[
{"id":1,"name":"apple","price":10},
{"id":1,"name":"orange","price":20},
{"id":1,"name":"pineapple","price":15},
]
df=pd.DataFrame(dict_list)
In your case
df4 = []
for i in (my_data.points.values.tolist()[0]):
# df3 = pd.json_normalize(j) since the structure is not mentioned,
# I'm assuming "i" as a dictionary which has the relevant row
df4.append(i)
df5 = pd.DataFrame(df4)
df5.head()
I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")
I have a loop within a nested loop that at the end generates 6 dictionaries. Each dictionary has the same key but different values, I would at the end of every iteration to append the dictionary to the same dataframe but it keeps failing.
At the end I would like to have a table with 6 columns plus an index which holds the keys.
This is the idea behind what I'm trying to do:
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary)
df_final = pd.concat([dictionary, df])
I get the error:
cannot concatenate object of type '<class 'dict'>'; only series and dataframe objs are valid
I created a practice dataset set if necessary here:
letts = [ ('a','b','c'),('e','f','g'),('h','i','j'),('k','l','m'),('n','o','p')]
numns = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]
dictionary = dict()
for i in letts:
for j in numns:
dictionary = dict(zip(i, j))
i am confusing about your practice dataset, but modifications below could provide an idea...
df_final = pd.DataFrame()
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary, index="index must be passed")
df_final = pd.concat([df_final, df])
Im looking for a way of comparing partial numeric values between columns from different dataframes, this columns are filled with something like social security numbers (they can´t and won´t repeat), so something like a dynamic isin() with be ideal.
This are representations of very large dataframes that I import from csv files.
{import numpy as np
import pandas as pd
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
print(df1)
print(df2)
df2['Id_number_length']= df2['Id_number'].str.len()
df2.groupby('Id_number_length').count()
count_list = df2.groupby('Id_number_length')[['Id_number_length']].count()
print('count_list:\n', count_list)
df1 ['S_number'] = pd.to_numeric(df1['S_number'], downcast = 'integer')
df2['Id_number'] = pd.to_numeric(df2['Id_number'], downcast = 'integer')
inner_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='inner')
print('MATCH!:\n', inner_join)
outer_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
print('UNMATCHED:\n', anti_join)
}
What I need to get is something as the following as a result of the inner join or whatever method:
{
df3 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"],
"Id_number": [ "027160", "60078","342964","763", "1544", "5303", "973637", "14452", "9930", "4205",]})
print('MATCH!:\n', df3)
}
I thought that something like this (very crude) pseudocode would work. Using count_list to strip parts of the numbers of df1 to fully match df2 instead of partially matching (notice that in df2 the missing or added numbers are always at the begining or the end)
{
for i in count_list:
if i ==6:
try inner join
except empty output
elif i ==5:
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[1:]
inner join with df2
except empty output
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[:-1]
inner join with df2
elif i == 4:
same as above...
}
But the lengths in count_list are variable so this for is an inefficient way.
Any help with this will be very appreciated, I´ve been stuck with this for days. Thanks in advance.
You can 'explode' each line of df1 into up to 45 lines. For example, SSN 123456789 can be map to [1,2,3...9,12,23,34,45..89,...12345678,23456789,123456789]. While this look bad, from algorithm standpoint it is O(1) for each row and therefore O(N) in total.
Using this new column as key, a simple 'merge on' can combine the 2 DFs easily - which is usually O(NlogN).
Here is an example of what I should do. I hope I've understood. Feel free to ask if it's not clear.
import pandas as pd
import joblib
from joblib import Parallel,delayed
# Building the base
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
# Initiate empty list for indexes
IDX = []
# Using un function to paralleliza it if database is big
def func(x,y):
if all(c in df2.Id_number[y] for c in df1.S_number[x]):
return(x,y)
# using the max of processors
number_of_cpu = joblib.cpu_count()
# Prpeparing a delayed function to be parallelized
delayed_funcs = (delayed(func)(x,y) for x in range(len(df1)) for y in range(len(df2)))
# fiting it with processes and not threads
parallel_pool = Parallel(n_jobs=number_of_cpu,prefer="processes")
# Fillig the IDX List
IDX.append(parallel_pool(delayed_funcs))
# Droping the None
IDX = list(filter(None, IDX[0]))
# Making df3 with the tuples of indexes
df3 = pd.DataFrame(IDX)
# Making it readable
df3['df1'] = df1.S_number[df3[0]].to_list()
df3['df2'] = df2.Id_number[df3[1]].to_list()
df3
OUTPUT :
I am really struggling to make it work...
How can I get a Series, transform it to a dataframe, add a column to it, and concatenate it in a loop?
The pseudo code is below, but the correct syntax is a mystery to me:
The Pseudo code is:
def func_B_Column(df):
return 1
df_1 = (...) # columns=['a', 'etc1', 'etc2']
df_2 = pandas.DataFrame(columns=['a','b','c'])
listOfColumnC = ['c1','c2','c3']
for var in listOfColumnC :
series = df_1.groupby('a').apply(func_B_Column) #series object should have now 'a' as index, and func_B_Column as value
aux = series.to_frame('b')
aux['c'] = aux.apply(lambda x: var, axis=1) #add another column 'c' to the series object
df_2 = df_2 .append(aux) #concatenate the results as rows, at the end
Edited after the question's refinement
df_2 = DataFrame()
for var in listOfColumnC :
df_2 = df_2.append(DataFrame({'b': df_1.groupby('a').apply(func_B_Column), 'c': var}))