Projects

High level API for interacting with a Gretel Project

class gretel_client.projects.Project(*, name: str, client: Client, project_id: str, desc: str = None)

Representation of a single Gretel project. In general you should not have to init this class directly, but can make use of the factory method from a Client instnace.

Using the factory method:

from gretel_client import get_cloud_client

client = get_cloud_client('api', 'your_api_key')
project = client.get_project(create=True)  # creates a project with an auto-named slug
delete()

Deletes this project. After this is called, this object can be discarded or deleted itself.

Note

If you attempt to use other methods on this project instance after deletion, you will receive API errors.

description = None

A short description of the project

property entities

Return all entities that have been observed in this project

property field_count

Return the total number of fields (as an int) in the project.

flush()

This will flush all project data from the Gretel metastore.

This includes all Field, Entity, and cached Record information

Note

This command runs asyncronously. When it returns, it means it has only triggered the flush operation in Gretel. This full operations may take several seconds to complete.

get_field_details(*, entity: str = None, count=500) → List[dict]

Return details for all fields in the project.

Parameters
  • entity – if an entity label is supplied, then only return fields that contain that entity

  • count – how many fields to retrieve

Returns

A list of dictionaries that match the Fields API schema from the Gretel REST API

get_field_entities(*, as_df=False, entity: str = None) → Union[List[dict], pandas.core.frame.DataFrame]

Download all fields from the Metastore and create flat rows of all field + entity relationships.

Normally, the list of all entities for a given field is stored in an array attached to the field level, here we will de-normalize this and create a single record for each field and entity combination.

So if a field called “foo” has 3 entities embedded inside its metadata, we’ll create 3 new rows out of this field metadata. We can then easily return this as a DataFrame.

Parameters
  • as_df – Return this dataset as a Pandas DataFrame

  • entity – Filter on a specific entity, if None, we’ll use all fields

Returns

A Pandas DataFrame or a list of dicts

head(n: int = 5) → pandas.core.frame.DataFrame

Get the top N records, flattened, and return them as a DataFrame. This mimics the DataFrame.head() method

Parameters

n – the number of records to retrieve

Returns a Pandas DataFrame

iter_records(**kwargs)

Iterate forwards (optionally waiting) or backwards in the record stream.

Parameters
  • position – Record ID that determines stream starting point.

  • post_process – A function to apply against incoming records. This is useful for applying record transformations.

  • direction – Determine what direction in time to move across a stream. Valid options include forward or backward.

  • record_limit – The number of records to iterate before terminating the iterator. If record_limit is less than zero, the iterator will continue forward in time indefinitely or backwards until the last record is reached. If the iterator is moving forward in time, and there are no new records on the stream, the function will block until more records become available.

  • wait_for – Time in seconds to wait for new records to arrive before closing the iterator. If the number is set to a value less than 0, the iterator will wait indefinitely.

Yields

An individual record object. If no record_limit is passed and the iterator is moving forward in time, the function will loop indefinitely waiting for new records. During this time, the function will block until new records become available. If the iterator is moving backwards (or historically) through a stream, the iterator will continue until the record_limit is reached, or until the first record in the stream is found.

name = None

The unique name of the project. This is either set by you or auto managed by Gretel

project_id = None

The unique Project ID for your project. This is auto-managed by Gretel

property record_count

Return the total number of records that have been ingested (as an int) for the project.

sample(n=10) → List[dict]

Return the top N records. These records will be in the raw format that they were received and will have all Gretel metadata attached.

Returns a list that matches the response from the REST API.

Note

The outter keys from the API response are removed and the list of records is only returned

send(data: Union[List[dict], dict]) → Tuple[list, list]

Write one or more records syncronously. This is similar to making a single API call to the records endpoint. You will also receive the success and failure arrays back which contain the Gretel IDs that were generated for each ingested record.

Note

Because this is just like making a single call to the Records endpoint, the maximum record count per-call will be enforced.

Parameters

data – a dict or a list of dicts

Returns

A tuple of (success, failure) lists

send_bulk(data: Union[List[dict], dict]) → WriteSummary

Send a dict or list of dicts to the project. Records are queued and send in parallel for performance. API reponses are not returned.

Note

Since a queue and threading is used here, you can send any number of records in the data param. The records will automatically be chunked up into appropiately sized buffers to send to the Records API.

Parameters

data – A dict or a list of dicts.

Returns

A WriteSummary instance.

send_dataframe(df: pandas.core.frame.DataFrame, sample=None) → WriteSummary

Send the contents of a DataFrame

This will convert each row of the DataFrame into a dictionary and send it as a record. This operation happens using the bulk writer so no results from the API calls are returned.

Parameters
  • df – A pandas DataFrame

  • sample – Specify a subset of the DataFrame rows to be sent. If sample is > 1, then sample number of rows will be queued for sending. If sample is between 0 and 1, then a fraction of the DataFrame’s rows will be queued for sending. So a sample of .5 will queue up half of the DataFrame’s rows.

Note

Sampling is randomized, not done by first N.

Returns

An instance of WriteSummary

Raises
  • RuntimeError if Pandas is not installed

  • ValueError if a Pandas DataFrame was not provided