I have a big dataframe of around 30000 rows and a single column containing a json string. Each json string contains a number of variables and its value I want to break this json string down into columns of data
two rows looks like
0 {"a":"1","b":"2","c":"3"}
1 {"a" ;"4","b":"5","c":"6"}
I want to convert this into a dataframe like
a b c
1 2 3
4 5 6
Please help
Your column values seem to have an extra number before the actual json string. So you might want strip that out first (skip to Method if that isn't the case)
One way to do that is to apply a function to the column
# constructing the df
df = pd.DataFrame([['0 {"a":"1","b":"2","c":"3"}'],['1 {"a" :"4","b":"5","c":"6"}']], columns=['json'])
# print(df)
json
# 0 0 {"a":"1","b":"2","c":"3"}
# 1 1 {"a" :"4","b":"5","c":"6"}
# function to remove the number
import re
def split_num(val):
p = re.compile("({.*)")
return p.search(val).group(1)
# applying the function
df['json'] = df['json'].map(lambda x: split_num(x))
print(df)
# json
# 0 {"a":"1","b":"2","c":"3"}
# 1 {"a" :"4","b":"5","c":"6"}
Method:
Once the df is in the above format, the below will convert each row entry to a dictionary:
df['json'] = df['json'].map(lambda x: dict(eval(x)))
Then, applying pd.Series to the column will do the job
d = df['json'].apply(pd.Series)
print(d)
# a b c
# 0 1 2 3
# 1 4 5 6
with open(json_file) as f:
df = pd.DataFrame(json.loads(line) for line in f)
Related
I have a pandas dataframe in python that I want to remove rows that contain letters in a certain column. I have tried a few things, but nothing has worked.
Input:
A B C
0 9 1 a
1 8 2 b
2 7 cat c
3 6 4 d
I would then remove rows that contained letters in column 'B'...
Expected Output:
A B C
0 9 1 a
1 8 2 b
3 6 4 d
Update:
After seeing the replies, I still haven't been able to get this to work. I'm going to just place my entire code here. Maybe I'm not understanding something...
import pandas as pd
#takes file path from user and removes quotation marks if necessary
sysco1file = input("Input path of FS1 file: ").replace("\"","")
sysco2file = input("Input path of FS2 file: ").replace("\"","")
sysco3file = input("Input path of FS3 file: ").replace("\"","")
#tab separated files, all values string
sysco_1 = pd.read_csv(sysco1file, sep='\t', dtype=str)
sysco_2 = pd.read_csv(sysco2file, sep='\t', dtype=str)
sysco_3 = pd.read_csv(sysco3file, sep='\t', dtype=str)
#combine all rows from the 3 files into one dataframe
sysco_all = pd.concat([sysco_1,sysco_2,sysco_3])
#Also dropping nulls from CompAcctNum column
sysco_all.dropna(subset=['CompAcctNum'], inplace=True)
#ensure all values are string
sysco_all = sysco_all.astype(str)
#implemented solution from stackoverflow
#I also tried putting "sysco_all = " in front of this
sysco_all.loc[~sysco_all['CompanyNumber'].str.isalpha()]
#writing dataframe to new csv file
sysco_all.to_csv(r"C:\Users\user\Desktop\testcsvfile.csv")
I do not get an error. However, the csv still has rows with letters in this column.
Assuming the B column be string type, we can use str.contains here:
df[~df["B"].str.contains(r'^[A-Za-z]+$', regex=True)]
here is another way to do it
# use isalpha to check if value is alphabetic
# use negation to pick where value is not alphabetic
df=df.loc[~df['B'].str.isalpha()]
df
A B C
0 9 1 a
1 8 2 b
3 6 4 d
OR
# output the filtered result to csv, preserving the original DF
df.loc[~df['B'].str.isalpha()].to_csv('out.csv')
I have a data frame that contains a column with both strings and lists.
import pandas as pd
data = {'lanes': ['1',['2','4'],'2','3',['1','2','3']]}
df = pd.DataFrame(data,columns=['lanes'])
df
original dat frame
I need to convert the strings to ints and replace the lists with means of the list elements. So, the output should look like this:
data2 = {'lanes': [1,3,2,3,2]}
df2 = pd.DataFrame(data2,columns=['lanes'])
df2
desired data frame
Can anyone give me some direction on how to do this, if you have done something like this before?
Use Series.explode, convert values to integers and then count mean per duplicated index by mean:
df['lanes'] = df['lanes'].explode().astype(int).mean(level=0)
print (df)
lanes
0 1
1 3
2 2
3 3
4 2
If data are not lists, but strings repr of lists use:
data = {'lanes': ['1',"['2','4']",'2','3',"['1','2','3']"]}
df = pd.DataFrame(data,columns=['lanes'])
import ast
df['lanes'] = df['lanes'].apply(ast.literal_eval).explode().astype(int).mean(level=0)
print (df)
lanes
0 1
1 3
2 2
3 3
4 2
You can try below snippet as well . It uses list comprehension to get the result
import pandas as pd
data = {'lanes': ['1',['2','4'],'2','3',['1','2','3']]}
def mean(lst):
return sum(lst) / len(lst)
data2 = dict()
data2['lanes']= [int(mean(i)) for i in [[int(x) for x in list] for list in data['lanes']]]
df2 = pd.DataFrame(data2,columns=['lanes'])
I have a csv file which is continuously updated with new data. The data has 5 columns, however lately the data have been changed to 4 columns. The column that is not present in later data is the first one. When I try to read this csv file into a dataframe, half of the data is under the wrong columns. The data is around 50k entries.
df
################################
0 time C-3 C-4 C-5
______________________________
0 1 str 4 5 <- old entries
0 1 str 4 5
1 str 4 5 Nan <- new entries
1 str 4 5 Nan
1 str 4 5 Nan
#################################
The first column in earlier entries (where value = 0) are not important.
My expected output is the Dataframe being printed with the right values in the right columns. I have no idea on how to add the 0 before the str where the 0 is missing. Or reverse, remove the 0. (since the 0 looks to be a index counter with value starting at 1, then 2, etc.)
Here is how i load and process the csv file at the moment:
def process_csv(obj):
data='some_data'
path='data/'
file=f'{obj}_{data}.csv'
df=pd.read_csv(path+file)
df=df.rename(columns=df.iloc[0]).drop(df.index[0])
df = df[df.time != 'time']
mask = df['time'].str.contains(' ')
df['time'] = (pd.to_datetime(df.loc[mask,'time'])
.reindex(df.index)
.fillna(pd.to_datetime(df.loc[~mask, 'time'], unit='ms')))
df=df.drop_duplicates(subset=['time'])
df=df.set_index("time")
df = df.sort_index()
return df
since column time should be of type int, a missing first column causes str values that should end up in C-3 to end up in column time which causes error:
ValueError: non convertible value str with the unit 'ms'
Question: How can I either remove the early values from column 0 or add some values to the later entries?
If one CSV file contains multiple formats, and these CSV files cannot be changed, then you could parse the files before creating data frames.
For example, the test data has both 3- and 4-field records. The function gen_read_lines() always returns 3 fields per records.
from io import StringIO
import csv
import pandas as pd
data = '''1,10,11
2,20,21
3,-1,30,31
4,-2,40,41
'''
def gen_read_lines(filename):
lines = csv.reader(StringIO(filename))
for record in lines:
if len(record) == 3: # return all fields
yield record
elif len(record) == 4: # drop field 1
yield [record[i] for i in [0, 2, 3]]
else:
raise ValueError(f'did not expect {len(records)}')
records = (record for record in gen_read_lines(data))
df = pd.DataFrame(records, columns=['A', 'B', 'C'])
print(df)
A B C
0 1 10 11
1 2 20 21
2 3 30 31
3 4 40 41
Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4
I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it