profanity filter что это
Build a profanity filter API with GraphQL
August 5, 2021 4 min read 1273
Editor’s note: Examples of profanity in this article are represented by the word “profanity” in order to remain inclusive and appropriate for all audiences.
Detecting and filtering profanity is a task you are bound to run into while building applications where users post (or interact with) text. These can be social media apps, comment sections, or game chat rooms, just to name a few.
Having the ability to detect profanity in order to filter it out is the solution to keeping communication spaces safe and age-appropriate, if your app requires.
This tutorial will guide you on building a GraphQL API to detect and filter profanity with Python and Flask. If you are just interested in the code alone, you can visit this GitHub repo for the demo application source code.
Prerequisites
To follow and understand this tutorial, you will need the following:
What is profanity?
Profanity (also known as curse words or swear words) refers to the offensive, impolite, or rude use of words and language. Profanity also helps to show or express a strong feeling towards something. Profanity can make online spaces feel hostile towards users, which is undesirable for an app designed for a wide audience.
Which words qualify as profanity is up to your discretion. This tutorial will explain how to filter words individually, so you have control over what type of language is allowed on your app.
What is a profanity filter?
A profanity filter is a software or application that helps detect, filter, or modify words considered profane in communication spaces.
Why do we detect and filter profanity?
Common problems faced when detecting profanity
Detecting profanity with Python
Using Python, let’s build an application that tells us whether a given string is profane or not, then proceed to filter it.
Creating a word-list-based profanity detector
To create our profanity filter, we will create a list of unaccepted words, then check if a given string contains any of them. If profanity is detected, we will replace the profane word with a censoring text.
Create a file named filter.py and save the following code in it:
Testing our word-list-based filter
If you were to pass the following arguments to the function above:
You would get the following results:
However, this approach has many problems ranging from being unable to detect profanity outside its word list to being easily fooled by misspellings or word paddings. It also requires us to regularly maintain our word list, which adds many problems to the ones we already have. How do we improve what we have?
Using the better-profanity Python library to improve our filter
Better-profanity is a blazingly fast Python library to check for (and clean) profanity in strings. It supports custom word lists, safelists, detecting profanity in modified word spellings, and Unicode characters (also called leetspeak), and even multi-lingual profanity detection.
Installing the better-profanity library
In the terminal, type:
Integrating better-profanity into our filter
Now, update the filter.py file with the following code:
Testing the better-profanity-based filter
If you were to pass the following arguments once again to the function above:
You would get the following results, as expected:
Like I mentioned previously, better-profanity supports profanity detection of modified word spellings, so the following examples will be censored accurately:
Better-profanity also has functionalities to tell if a string is profane. To do this, use:
Better-profanity also allows us provide a character to censor profanity with. To do this, use:
Building a GraphQL API for our filter
We have created a Python script to detect and filter profanity, but it’s pretty useless in the real world as no other platform can use our service. We’ll need to build a GraphQL API with Flask for our profanity filter, so we can call it an actual application and use it somewhere other than a Python environment.
Installing the application requirements
In the terminal, type:
Writing the application’s GraphQL schemas
Next, let’s write our GraphQL schemas for the API. Create a file named schema.py and save the following code in it:
Configuring our application server for GraphQL
After that, create another file named server.py and save the following code in it:
Running the GraphQL server
To run the server, execute the server.py script.
In the terminal, type:
Your terminal should look like the following:
Testing the GraphQL API
After running the server.py file in the terminal, head to your browser and open the URL http://127.0.0.1:5000. You should have access to the GraphiQL interface and get a response similar to the image below:
We can proceed to test the API by running a query like the one below in the GraphiQL interface:
The result should be similar to the images below:
Conclusion
This article taught us about profanity detection, its importance, and its implementation. In addition, we saw how easy it is to build a profanity detection API with Python, Flask, and GraphQL.
The source code of the GraphQL API is available on GitHub. You can learn more about the better-profanity Python library from its official documentation.
Monitor failed and slow GraphQL requests in production
LogRocket is like a DVR for web apps, recording literally everything that happens on your site. Instead of guessing why problems happen, you can aggregate and report on problematic GraphQL requests to quickly understand the root cause. In addition, you can track Apollo client state and inspect GraphQL queries’ key-value pairs.
LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.
Profanity filter что это
A fast, robust Python library to check for profanity or offensive language in strings. Read more about how and why profanity-check was built in this blog post. You can also test out profanity-check in your browser.
profanity-check uses a linear SVM model trained on 200k human-labeled samples of clean and profane text strings. Its model is simple but surprisingly effective, meaning profanity-check is both robust and extremely performant.
Why Use profanity-check?
No Explicit Blacklist
Many profanity detection libraries use a hard-coded list of bad words to detect and filter profanity. For example, profanity uses this wordlist, and even better-profanity still uses a wordlist. There are obviously glaring issues with this approach, and, while they might be performant, these libraries are not accurate at all.
Other libraries like profanity-filter use more sophisticated methods that are much more accurate but at the cost of performance. A benchmark (performed December 2018 on a new 2018 Macbook Pro) using a Kaggle dataset of Wikipedia comments yielded roughly the following results:
Package | 1 Prediction (ms) | 10 Predictions (ms) | 100 Predictions (ms) |
---|---|---|---|
profanity-check | 0.2 | 0.5 | 3.5 |
profanity-filter | 60 | 1200 | 13000 |
profanity | 0.3 | 1.2 | 24 |
This table speaks for itself:
Package | Test Accuracy | Balanced Test Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
profanity-check | 95.0% | 93.0% | 86.1% | 89.6% | 0.88 |
profanity-filter | 91.8% | 83.6% | 85.4% | 70.2% | 0.77 |
profanity | 85.6% | 65.1% | 91.7% | 30.8% | 0.46 |
See the How section below for more details on the dataset used for these results.
Note that both predict() and predict_prob return numpy arrays.
More on How/Why It Works
Special thanks to the authors of the datasets used in this project. profanity-check was trained on a combined dataset from 2 sources:
One simplified way you could think about why profanity-check works is this: during the training process, the model learns which words are «bad» and how «bad» they are because those words will appear more often in offensive texts. Thus, it’s as if the training process is picking out the «bad» words out of all possible words and using those to make future predictions. This is better than just relying on arbitrary word blacklists chosen by humans!
This library is far from perfect. For example, it has a hard time picking up on less common variants of swear words like «f4ck you» or «you b1tch» because they don’t appear often enough in the training corpus. Never treat any prediction from this library as unquestionable truth, because it does and will make mistakes. Instead, use this library as a heuristic.
About
A fast, robust Python library to check for offensive language in strings.
rominf / profanity-filter Go PK Goto Github PK
A Python library for detecting and filtering profanity
License: GNU General Public License v3.0
profanity-filter’s Introduction
profanity-filter: A Python library for detecting and filtering profanity
This library is no longer a priority for me. Feel free to fork it.
profanity-filter is a universal library for detecting and filtering profanity. Support for English and Russian is included.
Here are the basic examples of how to use the library. For more examples please see tests folder.
Using as a part of Spacy pipeline
RESTful web service
Go to the
First two parts of installation instructions are designed for the users who want to filter English profanity. If you want to filter profanity in another language you still need to read it.
For minimal setup you need to install profanity-filter with is bundled with spacy and download spacy model for tokenization and lemmatization:
For more info about Spacy models read: https://spacy.io/usage/models/.
To get deep analysis functionality install additional libraries and dictionary for your language.
Firstly, install hunspell and hunspell-devel packages with your system package manager.
For Amazon Linux AMI run:
Other language support
Let’s take Russian for example on how to add new language support.
Russian language support
Firstly, we need to provide file profanity_filter/data/ru_badwords.txt which contains a newline separated list of profane words. For Russian it’s already present, so we skip file generation.
Next, we need to download the appropriate Spacy model. Unfortunately, Spacy model for Russian is not yet ready, so we will use an English model for tokenization. If you had not install Spacy model for English, it’s the right time to do so. As a consequence, even if you want to filter just Russian profanity, you need to specify English in ProfanityFilter constructor as shown in usage examples.
Next, we download dictionaries in Hunspell format for deep analysis from the site https://cgit.freedesktop.org/libreoffice/dictionaries/plain/:
You need to install polyglot package and it’s requirements for language detection. See https://polyglot.readthedocs.io/en/latest/Installation.html for more detailed instructions.
For Amazon Linux AMI run:
RESTful web service
If something is not right, you can import dependencies yourself to see the import exceptions:
English profane word dictionary: https://github.com/areebbeigh/profanityfilter/ (author Areeb Beigh).
Russian profane word dictionary: https://github.com/PixxxeL/djantimat (author Ivan Sergeev).
profanity-filter’s People
Contributors
Stargazers
Watchers
Forkers
profanity-filter’s Issues
Minimize profane word dictionaries for deep analysis usage
TypeError when calling extra_profane_word_dictionaries
When supplying a dict (
Traceback (most recent call last): File » «, line 4, in File «/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py», line 265, in custom_profane_word_dictionaries self.clear_cache() File «/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py», line 384, in clear_cache self._update_profane_word_dictionary_files() File «/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py», line 429, in _update_profane_word_dictionary_files profane_word_file = self._DATA_DIR / f’
Fails to detect phrases
What am I doing wrong?
I created a simple service using instructions from your readme but nothing works.
And I got errors when I call censor method.
P.S. I tried different ways but have no luck.
Try TinyFastSS
Optionaly store cache in MongoDB
This will make parallelized censoring faster. This should be optional because the user will need to setup MongoDB and install additional dependencies.
Not working with auto-py-to-exe
Use more-itertools library
Can’t import profanity_filter
I’m getting this error when trying to import the library on a python terminal. Using both python 3.6.0 and 3.7.0
Some plurals not considered profane
Speedup initialization
The bottlenecks are:
Make tests faster
Every test a new instance of profanity filter is created. I think it should be possible to cache fixtures.
Unable to mark words as not profane(Customization / English)
Hey,
I have been using the Library to classify english texts.
The one problem I have been facing is that the tool is wrongly classifying words that have devil, hell or allah in it. I was wondering if I can remove those from the Library’s Dictionary.
Thanks,
Vyom
Show real profane word to a user
Hi Roman,
Thank you for sharing a code for your product. I learned a lot from it and
find it very powerful and reliable for the amount of features it provides. Did not try all of them yet though. 🙂
Have a suggestion.
Can we bring up the bad_word that was mutated by the user into result?
Ex, if I have «shiiiit» as an input, I would want to know what was the real bad_word that Levenshtein «had in mind» («shit»). This example is easy but sometimes there are cases when you cannot even guess why the word is censored.
Do you see a value in it? Do you think it makes sense to add it? Maybe by extra parameter if not always?
Thank you very much for being very responsive and providing an excellent support for your great product!
Windows 10, Python 3.8 can’t run console command
After installing, I’m not getting the console command to work.
It’s not in my C:\Python38\Scripts nor my C:\Users\abc\AppData\Roaming\Python\Python38\Scripts
Only first language in a list of languages is working
Invalid syntax in profanity_filter.py Class config
This error pops
SyntaxError: invalid syntax in File «/usr/local/lib/python3.5/dist-packages/profanity_filter/profanity_filter.py», line 102
censor_char: str = ‘*’
Is this a python3 issue? Does this only support python 2?
TypeError: __init__() got an unexpected keyword argument ‘lang’
Bug in saving profane word in redis
Expected behavior
Profane word is saved in redis.
Real behavior
Exception is thrown.
How to reproduce
_save_censored_word will fail` method will throw an exception
Failed to detect number substitutions
When trying to identify profane words sh1t is not getting identified as profane.
Levenstein approach should have identified the variation to the original profane word.
Also, I see that sh1t is listed under the profane word dictionary. Could you please see where the problem is?
Publish on Spacy website
Improve README.md
where to cd?
the Deep learning section contains code to cd int profanity_filter/data, where are these
Make REST webservice for profanity filtering
Also package it to the Docker.
Parallelize censoring
I think dask is a good solution because it has a nice API and can be used in a cluster.
The easiest and most effective parallelization is to map words after tokenization.
Make it possible to change DATA_DIR
It should be implemented as settable property. Note, that cache should be cleared after the setting the new value.
Get exception on particular input
For these inputs «deathfrom», «eskimobob», «»piazza@gma» with pf.censor_whole_words=False, pf.censor_word throws below exception.
Publish all dependencies on PyPI to avoid installation via git URLs
Refactor tests
Use the Spacy component for most tests, as it offers more information.
censor() and censor_word() give different results for profanity
How to explain this behavior in a current version?
Do not try to search profanity in compound words of dictionary words in emails and URLs
For example, these words should not be detected as profane: «deathfrom», «eskimobob» if they come as part of emails and URLs.
Recommend Projects
A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow
An Open Source Machine Learning Framework for Everyone
Django
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
Recommend Topics
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
server
A server is a program made to process requests and deliver data to clients.
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
Recently View Projects
profanity-filter
a python library for detecting and filtering profanity.
assume-role
assume-role: a cli tool making it easy to assume iam roles through an aws bastion account.
YetiForceCRM
our team created for you one of the most innovative crm systems that supports mainly business processes and allows for customization according to your needs. be ahead of your competition and implement
edna-sdk-android
edna Android SDK libraries and demo project
Hacktoberfest2021
This repository has been excluded from Hacktoberfest by DigitalOcean.
Profanity filter что это
Profanity Filter takes strings as input and removes any bad curse words that the string might have. It checks the strings for specific blacklist which must match as a separate word to be considered as a curse word. If a curse word is found, then it will replace the curse word with a censor character the user chooses (default is *).
This package is intended to used with Laravel. Tested and working with laravel 5.4.
This code is based on Fastwebmedia/Profanity-Filter. A major part of it is taken from there and I added the things that I thought it required.
###Laravel Add ‘Sworup\ProfanityFilter\ProfanityServiceProvider’ to your providers array.
If you wish to use the Facade then add ‘Profanity’ => ‘Sworup\ProfanityFilter\Profanity’
The package will automatically use the config file containing the list of banned words.
The above code would return:
Please see CHANGELOG for more information about what has changed recently.
Please see CONTRIBUTING and CONDUCT for details.
If you discover any security related issues, please email sworup.shakya@gmail.com instead of using the issue tracker.
The MIT License (MIT). Please see License File for more information.
About
Profanity filter package would help you censor some of the bad words users put in your posts and/or comments.
Profanity filter что это
The Profanity Filter for Rails ¶ ↑
This plugin will allow you to filter profanity using basic replacement or a dictionary term.
You can use it in your models: ¶ ↑
Notice – there are two profanity filters, one is destructive. Beware the exclamation point (profanity_filter!).
Non-Destructive (filters content when called, original text remains in the database)
Destructive (saves the filtered content to the database)
You can also use the filter directly: ¶ ↑
Inquiring minds can checkout the simple benchmarks I’ve included so you can have an idea of what kind of performance to expect. I’ve included some quick scenarios including strings of (100, 1000, 5000, 1000) words and dictionaries of (100, 1000, 5000, 25000, 50000, 100000) words.
You can run the benchmarks via:
May break ProfanityFilter out on it’s own
Clean up dictionary implementation and substitution (suboptimal and messy)
Move benchmarks into a rake task
Ability to supplement the profanity database (with a yaml outside of the gem) via @seankibler
Easy custom blacklists/dictionaries (essentially the same as above)
The Profanity Filter for Rails uses the MIT License. Please see the MIT-LICENSE file.
Created by Adam Bair (adam@intridea.com) of Intridea (www.intridea.com) in the open source room at RailsConf 2008. Originally called Fu-fu: The Profanity Filter for Rails.
About
A Rails plugin gem for filtering out profanity.