Nonprofits and Data

last person joined: yesterday 

This group is for those interested in learning and sharing about all things data-related for nonprofits. The Nonprofits and Data group is for people using data to serve a mission, either directly or by improving nonprofits and the nonprofit sector. That includes everything from collecting data and managing databases to analytics, data visualization and data mining. Here are some examples of topics we discuss: using data to improve organizational effectiveness, measuring impact, using data for storytelling, tools for data management and analysis, figuring out the “right” data to collect, and learning skills to help us use data better.

Tool for large email list (1 million+) deduplication

  • 1.  Tool for large email list (1 million+) deduplication

    Posted Apr 05, 2019 12:12
    Hello All,

    Does anyone have a recommendation for a tool to dedupe a large text file of email addresses? I've found free dedupe services online like List Scrubber but am hesitant to paste our entire email list on a site when I don't know where our data will end up. Any of the paid data cleansing products I've found have way more functionality than I need. I'm looking for a simple, secure and cheap solution for $25 or less.

    Many thanks!

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------
    Tech Accelerate


  • 2.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 06, 2019 08:30
    Edited by Dave Boyce Apr 06, 2019 08:31
    Hello Denise,

    Microsoft Excel has a remove duplicates function

    1. Paste Emails into Excel or open the CSV file in Excel
    2. Select 'Remove Duplicates' from the Data menu or by typing in the Excel Search
    3. Specify Email Column.

    DAVE BOYCE | Data Solutions Consultant
    501 Commons - Seattle, WA

    ------------------------------
    Dave Boyce
    ------------------------------

    Tech Accelerate


  • 3.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 08, 2019 09:59

    Thanks, Dave. Unfortunately, my file is too large for Excel. It's actually over 2 million emails. I think I've found a workaround in Access for now. I can import the list and do and in a query, change unique values to "Yes" on the Properties table. That should hold us for a while until we outgrow the 2GB capacity.

     

    Thanks again!

     

    Denise Cummings

    Data Systems Administrator

     

    Friends of the Earth

    Email: dcummings@foe.org

    Direct: 202-222-0718

    www.foe.org

     

     




    Tech Accelerate


  • 4.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 09, 2019 09:06
    Maybe use Power BI for the de-duplication. I would presume if you can do it in Access the file size should not be an issue and you wouldn't have to take the step of creating the unique "Yes" field. Not that that will save you that much time, but another option.

    Dan

    --
    Bennington College logovertical black pipe, a design featureDan Snyder 
    Assistant Director of Advancement Services 
    Office of Institutional Advancement 

    Bennington College 

    One College Drive 
    Bennington, VT 05201 
    Office: +1 (802) 447-4238 
    Fax: +1 (802) 440-4351

    To make a secure online gift click hereOr, for more information on ways to make a gift, including stock transfer instructions, please visit our website.



    Tech Accelerate


  • 5.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 09, 2019 10:28
    Thanks, Daniel. I just downloaded the free version.

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------

    Tech Accelerate


  • 6.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 10, 2019 14:04
    Hello:

    I'd like to revise David B's Excel solution by breaking the list down into two or three smaller lists - small enough that XL can handle them. Let XL sort each sub list alphabetically. Then take the top 1/3 of each sub list (A-J for example) and combine them into one (holding the top third of the alphabet) and de-duplicate that portion. Repeat as needed.

    I do not recommend feeding them all through your Outlook application. Outlook will handle them. Your ISP/email provider, probably, will not.

    Former XL support geek...

    ------------------------------
    DJ Brown
    Holder of multiple IT Hats.
    ISO of SAA, Inc.
    Houston, TX
    ------------------------------

    Tech Accelerate


  • 7.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 09, 2019 09:18
    Hi Denise,
    You might look into using the open source programming language Python for things like this. Python is free. It definitely requires some investment in learning to code, but I have found it well, well worth my time for working with data - all the time I spent learning it has been recouped 100x over.  I started with the book "automate the boring stuff with Python" and a couple of introductory books. Once you learn the basics, the possibilities are limitless.
    Ada

    ------------------------------
    Ada Welch
    Director of Planning and Evaluation
    Center for Urban Community Services
    New York, New York
    ------------------------------

    Tech Accelerate


  • 8.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 09, 2019 10:15
    Thanks, Ada. I'll definitely look into it.

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------

    Tech Accelerate


  • 9.  RE: Tool for large email list (1 million+) deduplication

    Posted Apr 09, 2019 09:45
    Maybe I missed something, but what are you using to manage your e-mails?  I'd be surprised if whatever system you are using for e-mail marketing doesn't have a function to handle duplicate e-mail addresses and it's not like you can just e-mail 2 million+ people out of your inbox.  That would also help you prevent duplicates moving forward as you add new people to your list.

    Or are you in between systems and that's why you're using Access?  I think you can still use SQL in Access -- again, you'd have to learn it or someone would have to know it, but you could then run a query to find and remove the duplicates on the data you have rather than having to create an extra field to keep track of that.  If this is a one-time situation, it might be cheaper to hire a freelancer who already knows SQL or Python or whatever to do it for you, because it should be fairly quick for them if it's based on something simple and straightforward like duplicate e-mail addresses and you're just removing whatever is not the first instance (as opposed to some other criteria of which record to keep).

    If this is not a one-time situation though, I might consider revisiting your data management plans...

    ------------------------------
    Janice Chan
    Co-Organizer, NTEN Nonprofits and Data group
    Twitter: @curiositybone

    Consultant, Shift and Scaffold
    www.shiftandscaffold.com
    ------------------------------

    Tech Accelerate


  • 10.  RE: Tool for large email list (1 million+) deduplication

    Posted May 02, 2019 17:30
    I echo this.
    I also will eschew excel for ANY data operations like this. it is just not a competant tool.
    If you cant read or write a CSV, then you're no use to me...

    Before I did the below, I'd do what Janice recommended. You have a really good tool at your disposal, before you try to do any heavy lifting.

    If your email system doesn't meet your needs, then:

    I'd use any decent test editor to do this job.
    Open your test file in notepad++, sort it by email address.
    From the Edit -> Line Operations, choose Remove Consecutive duplicates.

    This should clean up your file for you.

    HTH

    ------------------------------
    David Buttrick
    Application Architect
    Parents as Teachers
    St. Louis, MO
    ------------------------------

    Tech Accelerate


  • 11.  RE: Tool for large email list (1 million+) deduplication

    Posted May 03, 2019 10:37
    Hi David,

    Thanks for the tip on notepaad++!

    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------

    Tech Accelerate


  • 12.  RE: Tool for large email list (1 million+) deduplication

    Posted May 03, 2019 10:36
    Hi Janice,

    We use Engaging Networks which can doesn't allow duplicate email, and that's not really the issue. To give more context to, I'm exporting a list of emails out of Engaging Networks (1 million+) and want to compare it to a list from an outside organization for a joint email send. We want to deduplicate to remove any emails that appear on both lists and send the cleaned list back to our partner org. In this particular case, they're sending the email. This is only an issue when working outside of our system. We ended up having one of our consultants do it for us, and since this type of thing comes up regularly, we'll probably continue to do that.


    ------------------------------
    Denise Cummings
    Data Systems Administrator
    Friends of the Earth
    Washington, DC
    ------------------------------

    Tech Accelerate