Are you really interested in discovering insights and using data to solve problems, or are you simply into what LinkedIn calls “the most promising careers” and Glassdoor calls “the best jobs in America?” Chances are you will, whether you’re just attracted or not. Familiar with data science. But what about graph data science?
As already mentioned, charts are universal data structures with a wide range of representations.
From analytics to databases, knowledge management to data science, machine learning and even hardware.
Graph data science aims to answer questions not only about data, but also about the connections between data points. According to Alicia Frame, this is the explanation for 30 seconds.
Frame is Senior His Director of Product Management for Data Science at her Neo4j, a leading provider of graph databases. She has a PhD in Computational Biology and has been a practicing data scientist working with connected data for 10 years.
When she joined Neo4j about three years ago, she set out to build best-in-class connected data processing solutions for her data scientists. Today, the leader of Neo4j, Frame, a product worthy of the name Graph Data Science, celebrates her second anniversary with version 2.0, which brings several important advances.
New features include the availability of a native Python client and a managed her service called AuraDS on Google Cloud.
We interviewed Frame to discuss graph data science concepts and graph data science products.
concept: graph data science
The point of graph data science is to use relationships in the data. Most data scientists work with tabular data. However, charts are very important for better insight, to answer questions that can only be answered using connections, or simply to display data more faithfully.
As Frame explained, this should be explored using chart queries to find patterns known to exist, or using unattended methods such as chart algorithms to sift through the data. It means finding patterns. It can also mean using supervised machine learning for classification. what kind of graph is this? Or where will the relationship develop in the future?
product: graph data science
The Graph Data Science (GDS) product is a relatively new addition to the Neo4j ecosystem and serves two purposes. On the other hand, we want to appeal to business analysts and data analysts as well as data scientists who are not necessarily graph database users.
A key value proposition that GDS brings to them is not only the ability to store connected data in a connected format, but also perform everything from data analysis to persistence queries to training and modeling. It’s also about providing a single workspace and environment where you can. development. No ETL is required as the data is already stored as charts in Neo4j. But GDS also aims to cater to the more traditional layers of Neo4j.
Developer. Frame mentioned how Meredith Corporation used his Neo4j to build his user journey. As a follow-up to this use case, he used GDS to identify anonymous readers on websites.
This use case came from his longtime Neo4j developer enjoying the product. This led us to get more value out of it and ultimately led him to find a way to solve the problem using GDS. “They said, ‘Wait a minute, this [graphics] algorithm solves this really complex application problem that we have, so it fits perfectly into our pipeline. To do.”
GDS data scientist interface
Making his GDS easy to use for all potential users is a top priority for this release, including the availability of GDS as a managed his cloud offering. Neo4j already has a managed cloud product called Aura available on all major cloud platforms. After months of preview, GDS is now available on Google Cloud under his AuraDS name.
As Frame explained, AuraDS was built from the ground up to provide a tailored experience for data scientists. It’s based on his Aura board, but features a different configuration and is optimized for different setups. This touches upon many aspects.
On the technical front, data science workloads are typically much more memory-intensive, using more threads than database workloads. The team wanted to make sure they had the right configuration for data scientists to be successful.But where most of their time and effort was spent was building out a user interface that works for data scientists, she added.
The needs and skills of data scientists are different from those of developers:
they are interested in getting value from their data, finding new insights, and building more predictive models, not in setting up or maintaining a database. AuraDS has a completely rebuilt user interface making the user experience for data scientists more friendly,
She offered the example of helping users with sizing guidelines:
getting estimates of the numbers of nodes and edges in the graphs they want to work with, as well as the algorithms they want to run, and providing recommendations for the resources they will need. A set of metrics related to data scientists. B. Added CPU usage and memory usage.
Meet where the data scientist is
Another big improvement is the native Python client. First, data scientists don’t need to use Cypher, Neo4j’s query language, and can work directly in Python, the most popular choice for data scientists. Second, because you can work with both AuraDS and GDS directly from your notebook and get results via dataframes instead of going through his UI in Neo4j. Users can choose what works best for them. This shows the broader point of AuraDS.
General availability and advanced features are now also available in he GDS. Another example of this is persistence and backups. This is controlled by AuraDS, but is now also available in self-managed GDS. As Frame admits, working in memory is a double-edged sword. This allows for fast processing of large charts, but it also raises some concerns.
First, if the processing result needs to be saved, it is the user’s responsibility to handle it. Second, if a failure occurs before processing is complete, the work is lost and you have to start over. Execution of graphing algorithms in memory is fast, and there are safeguards in place to prevent the database from tipping over, so this wasn’t a big problem. However, it helps maintain an intermediate state.
Compatibility and sync
There are other operational improvements as well. GDS is now more compatible with transactional clusters. This means you don’t have to worry about copying data from your cluster to a single instance or back to your cluster from your dedicated Data Science instance.
“You don’t have to worry about that anymore, you don’t get anything that isn’t configured for either workload,” she added. She can now add her dedicated GDS node to the cluster. Updated data is retrieved automatically in real time. Data science workloads can run without impacting transactional workloads, and synchronization is done under the hood, so you don’t have to worry about ETL. Frame highlighted this improvement and customers picked it up and ran it before release. Also, you can now pause your instances to reduce costs without losing results.
Integrate and improve
GDS 2.0 also offers more machine learning and AutoML capabilities. Introduced the ability to create ML pipelines for tasks such as link prediction. This means that you can add missing relationships to diagrams or node taxonomies. For example, filling in missing labels, such as marking transactions as fraudulent or successful.
Frame explained how GDS introduces the concept of pipeline catalogs. This allows users to indicate that they want to train a model towards a specific end goal, and GDS assists with intermediate steps such as generating embeddings and selecting the best performing model.
This also relates to the broader story.
Integrations, especially with Google and its Vertex AI platform. Neo4j and Google are partners, which is why AuraDS was first brought to Google Cloud. Additionally, AuraDS and Vertex AI can be integrated, and Neo4j and Google have been and continue to work together on this.
The new integration is a significant addition to GDS/AuraDS. As Frame points out, data scientists don’t work in isolation. Therefore, it is important to support reading data to and from the GDS. GDS 2.0 supports his Neo4j connector with Apache Spark and BI tools like Microsoft Power BI, Tableau and Looker. Additionally, integrations with Dataiku and KNIME have been added.
Last but not least, GDS 2.0 includes new algorithms and improvements to existing algorithms. According to Neo4j, features such as “breadth-first search”, “depth-first search”, “K-nearest neighbor search”, and “delta stepping” have now reached the level of “product layering”.
big picture
Overall, GDS has undergone a major upgrade and overhaul. The launch of AuraDS brings the benefits of the cloud and advances GDS. GDS saw over 370% year-over-year growth in enterprise customers and hundreds of thousands of downloads. GDS 2.0 and AuraDS bring graph data science one step closer to mainstream adoption.
