Skip to content

Utilizing Machine Learning for Patent Searches Model Development

Google develops a collection of phrases for educating patent search models. Numerous patent holders utilize unconventional language to articulate their patent's topic, often describing a soccer ball as a spherical recreation device. This practice can lead to diversified and ineffective search...

Utilizing AI to Develop Proprietary Intelligence for Trademark Scouring
Utilizing AI to Develop Proprietary Intelligence for Trademark Scouring

Utilizing Machine Learning for Patent Searches Model Development

In the world of patents, navigating through complex and often non-standard language can be a daunting task. Many patent owners use unconventional terms to describe their inventions, making it challenging for searchers to find relevant patents. However, a solution is at hand, thanks to Google's BigQuery and curated datasets.

Google has made its vast patent data accessible to the public through the "patents-public-data" dataset on Google BigQuery. This dataset aggregates global patent information, including titles, abstracts, classifications, and inventor details. While Google does not provide a prepackaged dataset of phrases specifically designed for training patent search models, it offers a rich source of data that can be extracted and utilised for this purpose.

For those seeking a more focused dataset, Kaggle hosts curated datasets like the "CleanTech - Google Patent Dataset." This dataset, derived from Google Patents data, is particularly useful for those interested in renewable energy and sustainable technologies. It offers JSON files with patents filtered by keywords such as "solar energy," "photovoltaics," and "wind energy."

If you wish to create a customised phrase dataset for training, you can query the "patents-public-data.patents.publications" dataset on Google BigQuery. Write SQL queries to extract phrases or text segments from titles, abstracts, and descriptions based on your selection criteria. Export the results in formats like CSV or JSON suitable for machine learning training.

It's important to note that while Google has created a dataset of phrases for training patent search models, manual PDF downloads of individual patents from Google Patents for large-scale model training is impractical. Specialized patent search tools like PatentLens and USPTO's Global Patent Search Network may complement but do not contain Google's phrase datasets.

In essence, the best approach is to utilise Google's BigQuery patent data or trusted derivatives thereof to build or acquire a phrase dataset for training patent search models. The dataset, comprising approximately 50,000 phrase-to-phrase pairs, includes labels denoting how phrases are related to one another. An example of non-standard language is describing a soccer ball as a "spherical recreation device."

The dataset serves as a tool for patent owners and searchers to better navigate patent descriptions, improving the efficiency and accuracy of patent searches. The dataset can lead to more practical and focused search returns, making it an invaluable resource in the patent search landscape.

[1] Source: https://www.kaggle.com/google-research/cleantech-google-patent-dataset [2] Source: https://arxiv.org/abs/2006.03934 [3] Source: https://www.sciencedirect.com/science/article/pii/S2468051820300895 [4] Source: https://www.patentlens.org/; https://www.uspto.gov/patent/global-patent-search-network

[1] The dataset, primarily sourced from Google's BigQuery patent data, is a powerful tool for patent owners and searchers, offering approximately 50,000 phrase-to-phrase pairs and labels denoting their relationships.

[2] This dataset, augmented by AI technologies and data-and-cloud-computing solutions, can significantly improve the efficiency and accuracy of patent searches by aiding in navigating non-standard language often found in patent descriptions.

Read also:

    Latest