In the first column of a file I am looking at, it reads a list of 7 different comma separated values, like so:
alex,43,37,12,1,2,5
There are 2000 more rows that are set up the same way.
I only am interested in the first 2, the rest are unimportant to me. I am trying to assign the first two values into separate columns in my dataframe, like so:
typesx = (line.split(",") for line in df['firstcolumn'])
ty=((type[0], type[1]) for type in typesx)
for z in range(len(df)):
df['firstplaceholder'][z] = type(0)
df['secondplaceholder'][z] = type(1)
However, this only places the type into the respective columns (i.e int, string), go figure.
I am confused because if I choose to print type 0 and type 1, it will print what I am looking for (in this case alex, 43).
Like so:
for a,b in ty:
print(a,b)
However, I am unsure how to copy these values into the separate columns I have created (firstplaceholder, secondplaceholder).
It seems to me that you are making the mistake while retrieving the values. Try the below code.
df['firstplaceholder'][z] = ty[z][0]
df['secondplaceholder'][z] = ty[z][1]
df=pd.DataFrame(data={"1":["a,bb,c,d"],"2":["n,g,r,h,l"]},index=['firstcolumn']).T
df['secondcolumn']=df['firstcolumn'].apply(lambda s:s.split(',')[1])
df['firstcolumn'] =df['firstcolumn'].apply(lambda s:s.split(',')[0])
Related
I'm trying to replace every occurrence of an empty list [] in my script output with an empty cell value, but am struggling with identifying what object it is.
So the data output after running .to_excel looks like:
Now the data originally exists in JSON format and I'm normalizing it with data_normalized = pd.json_normalize(data). I'm trying to filter out the empty lists occurrences right after that with filtered = data_normalized.loc[data_normalized['focuses'] == []] but that isn't working. I've also tried filtered = data_normalized.loc[data_normalized['focuses'] == '[]']
The dtype for column focuses is Object if that helps. So I'm stuck as to how to select this data.
Eventually, I want to just instead run data_normalized.replace('[]', '') but with the first parameter updated so that I can select the empty lists properly.
You could try to cast the df to string type with pd.DataFrame.astype(str), and then do the replace with regex parameter as False:
df.astype(str).replace('[]','',regex=False)
Example:
df=pd.DataFrame({'a':[[],1,2,3]})
df.astype(str).replace('[]','',regex=False)
a
0
1 1
2 2
3 3
I have really less experience with pandas but since you cannot identify the object,try converting the list obtained to a string,then compare it to '[]'
for example,try using this
filtered = data_normalized.loc[string(data_normalized['focuses']) == '[]']
the data set contains Performance_UG which has values of '90.00/100.00' or '4.0/5.0' or '3.50/4.00' which is stored as object.
now i have to extract the 90 and 100 and divide them and then get the output i.e. 0.9 saved to a new column of the data frame as a float
how do i do that?
enter image description here
I'm assuming we are talking about pandas here. Each column then should have a map function. You can call this on each column and assign it to a new one. Perhaps something like:
def parseStuff(x):
x = x.strip()
if x == '0':
return 0
a, b = [float(i) for i in x.split('/')]
return a/b
df['Performance_UG_parsed'] = df['Performance_UG'].map(parseStuff))
If you want to create new columns or just replace the old one, just iterate over columns or cast it to the same column (although, for same column you can just use apply i think).
Note: You may want to cast it to something other then float, maybe a numpy type if you are working with that stuff.
I have a pandas df with a column (let's say col3) containing a number. These numbers are used in multiple rows and I want to run a function for rows of each number separatly.
So I wrote each number once into an array like this:
l = df.col3.unique()
Then a for loop is used to run a function for each number:
for i in l:
a,b = func(df[df.col3 == i])
So the function gets rows where col3 contains the value of i of each run. The function returns two data frames (a and b in this case).
I need these two returned data frames of each run.
I want to be able to identify them properly. For that I would like to save returned data frames within the loop like this:
First run: a123, b123
Second run a456, b456
Third run: a789, b789
Means the name of the dataframe contains the current value of i.
I already read I should not use global variables for dynamic variable names but do not know how to realize this instead.
Thank you :)
Solution A (recommended):
dfs = {}
for i in l:
dfs["a"+str(i)], dfs["b"+str(i)] = func(df[df.col3 == i])
...
And then you can use the dataframes like this:
func2(dfs["a1"]) # dfs["a1"] represents func(df[df.col3 == i])'s first return.
...
Solution B (not recommended)
If you absolutely want to use local variables, you need:
for i in l:
locals()["a"+str(i)], locals()["b"+str(i)] = func(df[df.col3 == i])
And then you can use the dataframes with their variable names a1,b1 etc.
My program contains many different NumPy arrays, with various data inside each of them. An example of an array is:
x = [5, 'ADC01', Input1, 25000], # Where [TypeID, Type, Input, Counts]
[5, 'ADC01', Input2, 40000]
From separate arrays I can retrieve the value of Type and Input. I then need to say
Counts = x[0,3] where Type = 'ADC01' and Input = 'Input2'
Obviously it would not be wrote like this. For the times that I have only needed to satisfy one condition, I have used:
InstType_ID = int(InstInv_Data[InstInv_Data[:,glo.inv_InstanceName] == Instrument_Type_L][0,glo.inv_TypeID])
Here, it looks in array(InstInv_Data) at the 'InstanceName' column and finds a match to Instrument_Type. It then assigns the 'TypeID' column to InstType_ID. I basically want to add an and statement so it also looks for another matching piece of data in another column.
Edit: I just thought that I could try and do this in two separate steps. Returning both Input and Counts columns where Type-Column = Type. However, I am unsure of how to actually return two columns, instead of a specific one. Something like this:
Intermediate_Counts = (InstDef_Data[InstDef_Data[:,glo.i_Type] == Instrument_Type_L][0,(glo.i_Input, glo.i_Counts])
You could use a & b to perform element-wise AND for two boolean arrays a, b:
selected_rows = x[(x[:,1] == 'ADC01') & (x[:,2] == 'Input2')]
Similarly, use a | b for OR and ~a for NOT.
I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.
I have a Pandas DataFrame (we'll call it df) that looks something like:
Name Value Value2
'A' '8.8.8.8' 'x'
'B' '6.6.6.6' 'y'
'A' '6.6.6.6' 'x'
'A' '8.8.8.8' 'x'
Where Name is the index. I want to convert this to something like that looks like:
Name Value Value2
'A' ['8.8.8.8', '6.6.6.6'] 'x'
'B' ['6.6.6.6'] 'y'
So, basically, every Value that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value for the corresponding index. And, as shown, Value2 is the same between like-indexed rows, so it should just stay the same in the end.
All I've done (successfully) is figure out how to make each element in the Value column into a list with:
df['Value'] = pd.Series([[val] for val in df['Value']])
In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum(). I know I need something besides df.index as an argument to groupby since the Value column is treated as special, and I'm not sure what to put in place of sum() since that's not quite what I'm looking for.
Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values into a list and updating df accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.
Edit: As a follow up to dermen's answer, that solution kind of worked. The Values did seem to concatenate correctly into a list. One thing I realized was that the unique function returns a Series, as opposed to a DataFrame. Also, I do have more columns in the actual setup than just Name, Value, and Value2. But I think I was able to get around both of the issues successfully with the following:
gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)
Where the first line gives an argument to groupby of the list of columns minus the Value column, and the second line converts the Series returned by unique into a DataFrame with the same columns as df.
But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value lists are truncated, which is probably a simpler issue to fix. The csv output currenlty looks like:
Name Value Value2 Name Value2
'A' ['8.8.8.8' '7.7.7.7' 'x'
'B' ['6.6.6.6'] 'y'
The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values for A (so that I can illustrate this point). When I do this with the actual data, the Value lists get cut off after the first 4 elements.
I think you are looking to use pandas.Series.unique. First, make the 'Name' index a column
df
# Value2 Value
#Name
#A x 8.8
#B y 6.6
#A x 6.6
#A x 8.8
df.reset_index(inplace=True)
# Name Value2 Value
#0 A x 8.8
#1 B y 6.6
#2 A x 6.6
#3 A x 8.8
Next call groupby and call the unique function on the 'Value' series
gb = df.groupby(['Name','Value2'])
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
# Name Value2 Value
#0 A x [8.8, 6.6]
#1 B y [6.6]
Finally, if you want 'Name' as the index again, just do
result.set_index( 'Name', inplace=True)
# Value2 Value
#Name
#A x [8.8, 6.6]
#B y [6.6]
UPDATE
As a follow up, make sure you re-assign result after resetting the index
result = gb['Value'].unique()
type(result)
#pandas.core.series.Series
result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame
saving as CSV (rather TSV)
You don't want to use CSV here because there are commas in the Value column entries. Rather, save as TSV, you still use the same method to_csv, just change the sep arg:
result.to_csv( 'result.txt', sep='\t')
If I load result.txt in EXCEL as a TSV, I get