AI
Python
DataNinja
]
When you work with text data, you often want to split it into training and test sets. This is something very usual in machine learning and also in deep learning´s natural language processing. The problem is that you have to split the data in a way that is consistent with the training and test sets, but at the same time you want to keep your data consistent.
NOTE: img source
When working with Named Entity Recognition (NER) you tipically work with IOB format. This is a very common format of tagging words in text data. But this comes with a problem…the data is stored in a very rudimentary and old way…the IOB format:
IOB – Inside-Outside-Beginning
- B-tag -> token is beginning of a chunk
- I-tag -> token is inside a chunk.
- O-tag -> token belongs to no chunk.
So you basically have a single line per word and each sample (phrase in our case) will be delimited by an empty “\n”.
Last B-DATE
year I-DATE
, O
8.3 B-QUANTITY
million I-QUANTITY
passengers O
flew O
the O
airline O
, O
down O
4 B-PERCENT
percent I-PERCENT
from O
2007 B-DATE
. O
Everyone O
knows O
about O
ADT, B-ORG
longtime O
home O
security O
system O
provider O
in O
the B-GPE
U.S. I-GPE
Here you will find a simple example of how to split a dataset into training and test sets:
""" | |
Contact me: | |
e-mail: enrique@enriquecatala.com | |
Linkedin: https://www.linkedin.com/in/enriquecatala/ | |
Web: https://enriquecatala.com | |
Twitter: https://twitter.com/enriquecatala | |
Support: https://github.com/sponsors/enriquecatala | |
Youtube: https://www.youtube.com/enriquecatala | |
Description | |
---- | |
This script splits the training data into train and test sets.This will split randomly into train and test the IOB file | |
Helper script to shuffle and split between train and test a file containing [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). | |
This is usually used when working with NLP tasks like Named Entity Recognition. | |
NOTE: Please, be aware that i´m using the split from memory, so if you have a big file, it will be slow or crash if file does´nt fit in memory | |
Input | |
---- | |
input_file | |
File with the IOB tagged phrases | |
percentage_train | |
Percentage of the data to be used for training | |
output_folder | |
Folder where the output files will be saved | |
Output | |
---- | |
Two files inside the output folder: | |
train.iob | |
test.iob | |
Phrases will be randomly splitted into train and test sets | |
""" | |
import sys | |
from transformers import AutoTokenizer | |
import argparse | |
import random | |
if __name__=='__main__': | |
argparser = argparse.ArgumentParser() | |
argparser.add_argument('input_file', type=str, help='Input file') | |
argparser.add_argument('percentage_train', type=float, help='Percentage of train', default=0.8) | |
argparser.add_argument('output_folder', type=str, help='Output folder', default='./') | |
# check if debug mode | |
gettrace = getattr(sys, 'gettrace', None) | |
if gettrace(): | |
#Simulate the args to be expected | |
argv = ["/home/enrique/git/roberta-finance-NER/data/ECB/training/checkpointLabelstudio.iob","0.8","/home/enrique/git/roberta-finance-NER/data/ECB/training/"] | |
# Parse arguments | |
args = argparser.parse_args(argv) | |
else: | |
args = argparser.parse_args() | |
input_file = args.input_file | |
percentage_train = args.percentage_train | |
output_folder = args.output_folder | |
phrases = [] | |
# read the input_file and count empty lines | |
with open(input_file, 'r') as f: | |
lines = f.readlines() | |
# number of phrases = number of \n lines | |
num_phrases = 0 | |
phrase = [] | |
for line in lines: | |
phrase.append(line) | |
if line == '\n': | |
num_phrases += 1 | |
phrases.append(phrase) | |
phrase = [] | |
# number of phrases detected | |
print('Number of phrases detected: {}'.format(num_phrases)) | |
# Number of train_phrases | |
train_phrases = int(num_phrases * percentage_train) | |
print("Train phrases: {}".format(train_phrases)) | |
# Number of test_lines | |
test_phrases = num_phrases - train_phrases | |
print("Test phrases: {}".format(test_phrases)) | |
# Shuffling phrases | |
print("Shuffling phrases randomly...") | |
random.shuffle(phrases) | |
print(f"Writting data to folder: {output_folder}") | |
# read the input_file and write each line to a new file | |
train_phrases_written = 0 | |
with open(f'{output_folder}/train.iob', 'w') as f_train: | |
with open(f'{output_folder}/test.iob', 'w') as f_test: | |
for phrase in phrases: | |
if train_phrases_written < train_phrases: | |
f_train.write(''.join(phrase)) | |
train_phrases_written += 1 | |
else: | |
f_test.write(''.join(phrase)) | |
print("Done!") |
Enjoy!