Pandas read_csv gives decimal column numbers - python

I've been pulling my hair out trying to make a bipartite graph from a csv file and so far all I have is a panda matrix that looks like this
My code so far is just
`
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# import pyexcel as pe
# import pyexcel.ext.xlsx
from networkx.algorithms import bipartite
mat = pd.read_csv("networkdata3.csv")
# mat = pd.read_excel("networkdata1.xlsx",sheet_name="sheet_name_1")
print(mat.info)
sand = nx.from_pandas_adjacency(mat)
`
and I have no clue what I'm doing wrong. Initially I was trying to read it in as the original xlsx file but then I just converted it to a csv and it started reading. I assume I can't make the graph because the column numbers are decimals and the error that spits out claims that the column numbers don't match up. So how else should I be doing this to actually start making some progress?

Related

How to create a dataframe of one index of a dataset?

I have a dataset NearGrid with dimensions (index:225, time:25933) that contains daily temperature data for 225 locations.
How can I create a dataframe for the first location (index=0) where the columns are date and tmax and each row represents one day of data (i.e. 25933 rows x 2 columns)?
Here's what I'm trying:
#import libraries
import os
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
#open netcdf files
df=xr.open_mfdataset('/glacier1/mmartin/data/NOAATmax/Tmax****.nc')
#open cm stations csv and create new dataset with only points closest to stations in CMStations
CMStations=pd.read_csv('Slope95.csv')
Lat=CMStations.lat
Lon=CMStations.lon
NearGrid=df.sel(lat=Lat.to_xarray(), lon=Lon.to_xarray(), method='nearest')
#create dataframe of first location in NearGrid
NearGrid.isel(index=0).to_dataframe()
but when I do this the code runs indefinitely and nothing happens.
The problem was the way the data was chunked. When I saved the subsetted data as a new netcdf file and then opened it in a new notebook, it worked. I did that through this:
#import libraries
import os
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
#open netcdf files
df=xr.open_mfdataset('/glacier1/mmartin/data/NOAATmax/Tmax****.nc')
#open cm stations csv and create new dataset with only points closest to stations in CMStations
CMStations=pd.read_csv('Slope95.csv')
Lat=CMStations.lat
Lon=CMStations.lon
NearGrid=df.sel(lat=Lat.to_xarray(), lon=Lon.to_xarray(), method='nearest')
#save as new netcdf file
NearGrid.to_netcdf('/glacier1/mmartin/data/NOAATmax/Tmax_CMStations_19510101-20211231.nc')
I then opened this file in a new notebook and manipulated the data there

Text file writing: next line starting from last value

I have a list of networkx graphs, and I am trying to write a text file containing a massive edge list of all graphs. If you run the following code:
from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='data/TUDataset', name='MUTAG')
Then do to data->TUDataset->MUTAG->raw, I am trying to replicate the raw files but using my data.
My raw data is a MATLAB .mat file containing a struct where the first column A is each individual graph's corresponding adjacency matrix where i create the networkx graph:
from scipy.io import loadmat
import pandas as pd
raw_data = loadmat('data_final3.mat', squeeze_me=True)
data = pd.DataFrame(raw_data['Graphs'])
import networkx as nx
A = data.pop('A')
nx_graph = []
for i in range(len(A)):
nx_graph.append(nx.Graph(A[i]))
I created the MUTAG_graph_indicator file using:
with open('graph_indicator.txt', 'w') as f:
for i in range(len(nx_graph)):
f.write((str(i)+'\n')*len(nx_graph[i].nodes))
If there is a way to do this either using python or MATLAB, I would greatly appreciate the help. Yes, torch_geometric does have from_networkx, but it doesn't seem to contain the same information as if I created the torch_geometric graphs the same way as the sample data.

CSV data structure printing in messy format

I've exported an Excel into a CSV where all the columns and entires look correct and normal. However, when I put it into a data frame and print the head, the structure becomes very messy and unreadable due to columns being unstructured.
As you can see in the image, the values are not neatly under user_id.
https://imgur.com/a/gbWaTwi
I'm using the following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
then
df1 = pd.read_csv('../doc.csv', low_memory=False)
df1.head
Do this --- print the invocation of head. Just saying .head isn't enough.
print(df1.head())

Using Dask with Python causes issues when running Pandas code

I am trying to work with Dask because my dataframe has become large and that pandas by itself can't simply process it. I read my dataset in as follows and get the following result that looks odd, not sure why its not outputting the dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import dask.bag as db
import json
%matplotlib inline
Leads = db.read_text('Leads 6.4.18.txt')
Leads
This returns (instead of my pandas dataframe):
dask.bag<bag-fro..., npartitions=1>
Then when I try to rename a few columns:
Leads_updated = Leads.rename(columns={'Business Type':'Business_Type','Lender
Type':'Lender_Type'})
Leads_updated
I get:
AttributeError: 'Bag' object has no attribute 'rename'
Can someone please explain what I am not doing correctly. The ojective is to just use Dask for all these steps since it is too big for regular Python/Pandas. My understanding is the syntax used under Dask should be the same as Pandas.

python script converting .dat to json

I have .dat file that I want to use in my script which draws scatter graph with data input from that .dat file. I have been manually converting .dat files to .csv for this purpose but I find it not satisfactory.
This is what I am using currently.
import pandas as pd import matplotlib.pyplot as plt import numpy as np
filename=raw_input('Enter filename ')
csv = pd.read_csv(filename)
data=csv[['deformation','stress']]
data=data.astype(float)
x=data['deformation']
y=data['stress']
plt.scatter(x,y,s=0.5)
fit=np.polyfit(x,y,15)
p=np.poly1d(fit)
plt.plot(x,p(x),"r--")
plt.show()
Programmer friend told me that it would be more convenient to convert it to JSON and use it as such. How would I go about this?
try using the numpy read feature
import numpy as np
yourArray = np.fromfile('YourData.dat',dtype=dtype)
yourArray = np.loadtxt('YourData.dat')
loadtxt is more flexible than fromfile

Categories