What is the goal of this data collection campaign?
The goal of this campaign is to collect as much high-quality Papiamentu data as possible. This data will be used to train a Papiamentu foundation model. A foundation model is a core AI system that understands and generates Papiamentu. You can think of it as something similar to ChatGPT, but built specifically for and in Papiamentu.
Why are you building a foundation model?
A foundation model is the core technology behind AI tools like ChatGPT and Gemini. Most existing foundation models are trained mainly on English and other widely used languages because there is a lot of data available online. For languages like Papiamentu, which are less represented on the internet, these models perform poorly. They often produce low-quality language output and fail to understand the cultural context, values, and real-life needs of Papiamentu speakers. Another important reason is control. Most large language models are owned and managed by big tech companies. When they change their models or policies, users have no say. This project offers an alternative: an AI model built for the Papiamentu-speaking community and governed by that community.
Why is this project open source, and what does that mean in practice?
This project is a community effort and depends on contributions from many people and organizations. Making the project open source means that the work we do is shared openly and transparently, so others can learn from it, build on it, and benefit from it. In practice, this means that SPLIKA gives back to the community and ensures that the project remains accessible, accountable, and collaborative.
Who is behind this initiative and how is it governed?
The initiative is led by the SPLIKA Digital & AI committee and is embedded within the non-profit organization SPLIKA. SPLIKA oversees the project, sets its direction, and ensures that all activities align with ethical standards and applicable EU data and AI regulations. The mission of SPLIKA Digital & AI committee is to develop AI technologies that support communication in Papiamento and Papiamentu in a fair, ethical, and responsible way.
Who will benefit from this model?
Anyone interested in using AI in Papiamentu can benefit. In particular, this includes: business owners, government institutions, educators and students, and content creators. The model can support new businesses, improve access to educational content, speed up administrative tasks, and help expand digital tools in Papiamentu.
What types of data are you looking for?
We are primarily looking for text-based data, such as: newspaper articles, books, poetry, dictionaries, and educational materials. However, we are also interested in other types of Papiamentu-related data, including images, videos, audio recordings, radio broadcasts, and more. If you are unsure whether your data is relevant, feel free to reach out.
What types of data should not be submitted?
You should not submit: illegal data, personally identifiable or sensitive personal data, data you do not want SPLIKA to have, or data you do not own or are not allowed to share. AI models can sometimes reproduce parts of their training data, so anything included should be suitable for public use.
How much data can I donate?
There is no minimum or maximum amount. You decide how much data you would like to donate.
Is my data required to be in a specific format?
The preferred format is .txt, as it is the easiest for us to process. However, we also accept PDFs, Word documents, images (JPG, PNG), and other formats. Along with the data, we ask for additional information such as: who owns the copyright, when the data was created, and what type of data it is. This information helps us understand the dataset and is essential for creating proper documentation (data cards).
Can individuals and organizations both donate data?
Yes. Anyone including individuals, organizations, institutions, and companies, can donate data.
How will my data be used to train the model?
First, the data is prepared for use. This includes cleaning it and converting it into plain text. If the data comes from formats like PDFs, the text is extracted first. The cleaned text is then used to train a computer system to understand how Papiamentu works. The system learns by reading many examples and practicing how to continue sentences. Over time, it learns grammar, word meanings, and sentence structure. Once trained, the model can be used in a chat-style interface, allowing people to interact with it conversationally.
Will my data be combined with other datasets?
The data is stored separately. However, during training, the model learns from many datasets together.
Will the data be used for purposes beyond model training?
SPLIKA Digital & AI uses the data primarily to train open-source AI models and to conduct research on responsible AI behavior, including safety and bias. Summary documentation is published for transparency. The data is not shared with third parties without the donor’s prior written consent, unless required by law.
Could my data be used in commercial applications?
Not directly and not intentionally. However, AI models developed by SPLIKA Digital & AI may be used in public applications, including commercial ones.
Will the trained model reproduce or reveal my original data?
AI models are not deterministic and may sometimes reproduce patterns or fragments from training data. This is why donors should avoid submitting sensitive or private information.
How do you protect sensitive or personal information?
SPLIKA Digital & AI takes strong measures to protect data. Donor information is stored separately from the data itself and is secured using encryption. Internally, anonymous identifiers are used instead of personal names. Only authorized members of the team can access the data.
Is my data anonymized or de-identified?
Yes. By separating donor information from the data we ensure that the data will not be traceable to the donor. Additionally, we run a pre-processing step that aims to remove as much personally identifiable information as possible. Donors are also asked to estimate how much personal information is present in the data.
What steps do you take to prevent misuse of the data?
We protect the data by: encrypting it, separating donor information from the data, and limiting access to authorized personnel only. When donating data, donors enter into a legally binding agreement that makes SPLIKA responsible for using the data as described.
How do you comply with data protection laws and regulations?
We comply by removing personally identifiable information, using the data only for its stated purpose, and securing it with encryption and strict access controls.
Who will have access to the raw data?
Only several key and authorized people within SPLIKA Digital & AI has access to the raw data.
Do I retain ownership of my data after donating it?
Yes. You retain ownership of your data.
What rights am I granting when I donate data?
By donating data, you grant SPLIKA a non-exclusive, worldwide, royalty-free, and perpetual license to use the data for: training AI models, conducting relevant research, and creating documentation.
Can I place restrictions on how my data is used?
SPLIKA Digital & AI will use your data only for training models, conducting research, and creating documentation. Within these activities, restrictions cannot be applied. Any additional use would require your explicit consent.
Can I withdraw my data after donating it?
You can request that parts of the data be deleted, but full removal may not always be possible.
What happens if my data includes information about other people?
If possible, you should remove that information yourself. Otherwise, please indicate its presence so SPLIKA Digital & AI can remove it during processing.
How does the data donation portal work?
The data portal is a secure upload system. After creating an account and receiving approval, you can request a timeframe during which you can upload data. You can also provide information about each file and view all uploaded files.
Is the portal secure?
Yes. Only approved users can access the portal, and all data is transmitted through encrypted connections.
What happens after I upload my data?
You will receive an e-mail confirming that the upload was successful. SPLIKA Digital & AI is notified as well.
Will I receive confirmation or updates?
Yes. You will receive confirmation by e-mail, and donors may receive additional updates related to the use of their data.
Who can I contact if I have technical issues?
You can send an e-mail to data@papiamentu.ai, and SPLIKA Digital & AI will respond as soon as possible.
How can I donate data without using the portal?
Contact SPLIKA Digital & AI via e-mail at data@papiamentu.ai. We will agree on a suitable method, such as in-person transfer, e-mail, or cloud storage access. The data donation agreement must be accepted before proceeding.
Are offline or bulk submissions accepted?
Yes. SPLIKA is flexible in how data is received.
Is there a difference in how portal and non-portal data is handled?
No. Data received through other methods is uploaded to the portal on your behalf.
Will you publish documentation about the dataset and model?
Yes. We will publish both a data card and a model card.
How can I track the progress of the project?
SPLIKA Digital & AI sends a quarterly update e-mail to the mailing list. Donors may also receive additional updates. We also aim to host workshops at least twice a year. Our Data Donation page will be updated periodically.
Will you share reports on how donated data is used?
Yes. Periodic reports will be published.
How can the community provide feedback or raise concerns?
You can contact SPLIKA Digital & AI via e-mail or participate in the online workshops.
What are the risks of donating data?
If you are unsure what is in your data, you may accidentally share sensitive information. You can request the removal of specific data if needed.
What are the limitations of the model you are building?
There exists various limitations, but primarily the model can only perform tasks for which sufficient data exists. If certain data is missing, the model cannot learn that task.
How do you address bias or gaps in the data?
We actively research biases and gaps and take steps to address them. In some cases, data may be excluded.
What happens if harmful or low-quality data is submitted?
SPLIKA Digital & AI decides which data to include. Harmful data is excluded, and illegal data may be reported to the authorities, as required by the agreement.
How will my donation make a difference?
Papiamentu data is scarce. By donating, you help build a high-quality dataset that powers an open AI model available to everyone for free.
Will contributors be acknowledged?
With consent, contributors’ names will be listed on the Contributors page.
Can I reference my participation publicly?
Yes, and we encourage it. You may link to the Contributors page.
How does this project contribute to the broader AI ecosystem?
It supports a more open, community-driven, and democratic approach to AI development.
Who should consider donating data?
Anyone with Papiamentu data they are allowed to share should consider donating. This includes individuals, publishers, and organizations.
How do I know if my data is valuable?
If the data was created or reviewed by a Papiamentu professional, it is likely valuable. If you are unsure, contact SPLIKA Digital & AI and we can review it with you.
Where can I learn more before donating?
This FAQ is a good starting point. You can also contact us by e-mail.
How can I get involved beyond donating data?
You can explore volunteer opportunities on the Volunteering page.